Show simple item record

dc.contributor.author
Patel, Minesh
dc.contributor.author
de Oliveira, Geraldo F.
dc.contributor.author
Mutlu, Onur
dc.date.accessioned
2021-11-25T08:35:16Z
dc.date.available
2021-11-18T04:54:03Z
dc.date.available
2021-11-25T08:35:16Z
dc.date.issued
2021
dc.identifier.isbn
978-1-4503-8557-2
en_US
dc.identifier.other
10.1145/3466752.3480061
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/515705
dc.description.abstract
Aggressive storage density scaling in modern main memories causes increasing error rates that are addressed using error-mitigation techniques. State-of-the-art techniques for addressing high error rates identify and repair bits that are at risk of error from within the memory controller. Unfortunately, modern main memory chips internally use on-die error correcting codes (on-die ECC) that obfuscate the memory controller's view of errors, complicating the process of identifying at-risk bits (i.e., error profiling). To understand the problems that on-die ECC causes for error profiling, we analytically study how on-die ECC changes the way that memory errors appear outside of the memory chip (e.g., to the memory controller).We show that on-die ECC introduces statistical dependence between errors in different bit positions, raising three key challenges for practical and effective error profiling: on-die ECC (1) exponentially increases the number of at-risk bits the profiler must identify; (2) makes individual at-risk bits more difficult to identify; and (3) interferes with commonly-used memory data patterns that are designed to make at-risk bits easier to identify. To address the three challenges, we introduce Hybrid Active- Reactive Profiling (HARP), a new error profiling algorithm that rapidly achieves full coverage of at-risk bits based on two key insights. First, errors that on-die ECC fails to correct have two sources: (1) direct errors from raw bit errors in the data portion of the ECC word and (2) indirect errors that on-die ECC introduces when facing uncorrectable errors. Second, the maximum number of indirect errors that can occur concurrently is limited to the correction capability of on-die ECC. HARP's key idea is to first identify all bits at risk of direct errors using existing profiling techniques with the help of small modifications to the on-die ECC mechanism. Then, a secondary ECC within the memory controller with correction capability equal to or greater than that of on-die ECC can safely identify bits at-risk of indirect errors, if and when they fail. We evaluate HARP in simulation relative to two state-of-the-art baseline error profiling algorithms. We show that HARP achieves full coverage of all at-risk bits faster (e.g., 99th-percentile coverage 20.6%/36.4%/52.9%/62.1% faster, on average, given 2/3/4/5 raw bit errors per ECC word) than the baseline algorithms, which sometimes fail to achieve full coverage. We perform a case study of how each profiler impacts the system's overall bit error rate (BER) when using a repair mechanism to tolerate DRAM data-retention errors. We show that HARP identifies all errors faster than the best-performing baseline algorithm (e.g., by 3.7× for a raw per-bit error probability of 0.75). We conclude that HARP effectively overcomes the three error profiling challenges introduced by on-die ECC.
en_US
dc.language.iso
en
en_US
dc.publisher
ACM
en_US
dc.subject
On-Die ECC
en_US
dc.subject
DRAM
en_US
dc.subject
Memory Test
en_US
dc.subject
Repair
en_US
dc.subject
Error Profiling
en_US
dc.subject
Error Modeling
en_US
dc.subject
Memory Scaling
en_US
dc.subject
Reliability
en_US
dc.subject
Fault Tolerance
en_US
dc.title
HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes
en_US
dc.type
Conference Paper
dc.date.published
2021-10-18
ethz.book.title
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
en_US
ethz.pages.start
623
en_US
ethz.pages.end
640
en_US
ethz.event
54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021)
en_US
ethz.event.location
online
en_US
ethz.event.date
October 18-22, 2021
en_US
ethz.identifier.scopus
ethz.publication.place
New York
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::09483 - Mutlu, Onur / Mutlu, Onur
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::09483 - Mutlu, Onur / Mutlu, Onur
ethz.relation.isReferencedBy
10.3929/ethz-b-000542542
ethz.date.deposited
2021-11-18T04:54:16Z
ethz.source
SCOPUS
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.installDate
2021-11-25T08:35:31Z
ethz.rosetta.lastUpdated
2022-03-29T16:08:19Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=HARP:%20Practically%20and%20Effectively%20Identifying%20Uncorrectable%20Errors%20in%20Memory%20Chips%20That%20Use%20On-Die%20Error-Correcting%20Codes&rft.date=2021&rft.spage=623&rft.epage=640&rft.au=Patel,%20Minesh&de%20Oliveira,%20Geraldo%20F.&Mutlu,%20Onur&rft.isbn=978-1-4503-8557-2&rft.genre=proceeding&rft_id=info:doi/10.1145/3466752.3480061&rft.btitle=MICRO%20'21:%20MICRO-54:%2054th%20Annual%20IEEE/ACM%20International%20Symposium%20on%20Microarchitecture
 Search print copy at ETH Library

Files in this item

FilesSizeFormatOpen in viewer

There are no files associated with this item.

Publication type

Show simple item record