Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes

Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Improvements in main memory storage density are primarily driven by process
technology shrinkage (i.e., technology scaling), which negatively impacts
reliability by exacerbating various circuit-level error mechanisms. To
compensate for growing error rates, both memory manufacturers and consumers
develop and incorporate error-mitigation mechanisms that improve manufacturing
yield and allow system designers to meet reliability targets. Developing
effective error mitigation techniques requires understanding the errors'
characteristics (e.g., worst-case behavior, statistical properties).
Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC)
used in modern memory chips introduce new challenges to efficient error
mitigation by obfuscating CPU-visible error characteristics in an unpredictable,
ECC-dependent manner.
In this dissertation, we experimentally study memory errors, examine how on-die
ECC obfuscates their statistical characteristics, and develop new testing
techniques to overcome the obfuscation through four key steps. First, we
experimentally study DRAM data-retention error characteristics to understand the
challenges inherent in understanding and mitigating memory errors that are
related to technology scaling. Second, we study how on-die ECC affects these
characteristics to develop Error Inference (EIN), a new statistical inference
methodology for inferring key details of the on-die ECC mechanism and the raw
errors that it obfuscates. Third, we examine the on-die ECC mechanism in detail
to understand exactly how on-die ECC obfuscates raw bit error patterns. Using
this knowledge, we introduce Bit Exact ECC Recovery (BEER), a new testing
methodology that exploits uncorrectable error patterns to (1) reverse-engineer
the exact on-die ECC implementation used in a given memory chip and (2) identify
the bit-exact locations of the raw bit errors responsible for a set of errors
that are observed after on-die ECC correction. Fourth, we study how on-die ECC
impacts error profiling and show that on-die ECC introduces three key challenges
that negatively impact profiling practicality and effectiveness. To overcome
these challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new
error profiling strategy that uses simple modifications to the on-die ECC
mechanism to quickly and effectively identify bits at risk of error. Finally, we
conclude by discussing the critical need for transparency in DRAM reliability
characteristics in order to enable DRAM consumers to better understand and adapt
commodity DRAM chips to their system-specific needs.
This dissertation builds a detailed understanding of how on-die ECC obfuscates
the statistical properties of main memory error mechanisms using a combination
of real-chip experiments and statistical analyses. Our results show that the
error characteristics that on-die ECC obfuscates can be recovered using new
memory testing techniques that exploit the interaction between on-die ECC and
the statistical characteristics of memory error mechanisms to expose physical
cell behavior. We hope and believe that the analysis, techniques, and results we
present in this dissertation will enable the community to better understand and
tackle current and future reliability challenges as well as adapt commodity
memory to new advantageous applications. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000542542Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Mutlu, Onur
Examiner: Erez, Mattan
Examiner: Qureshi, Moinuddin
Examiner: Sridharan, Vilas
Examiner: Weis, Christian
Publisher
ETH ZurichSubject
Memory Reliability; Memory Systems; Memory Errors; DRAM; Computer Engineering; Error Correction; Error Characterization; On-Die ECC; ECC; Error Profiling; Fault Tolerance; Memory Repair; Memory Scaling; System Reliability; SimulationOrganisational unit
09483 - Mutlu, Onur / Mutlu, Onur
Related publications and datasets
References: http://hdl.handle.net/20.500.11850/192147
References: http://hdl.handle.net/20.500.11850/365008
References: http://hdl.handle.net/20.500.11850/457320
References: http://hdl.handle.net/20.500.11850/515705
More
Show all metadata
ETH Bibliography
yes
Altmetrics