Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes


Loading...

Author / Producer

Date

2021

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Improvements in main memory storage density are primarily driven by process technology shrinkage (i.e., technology scaling), which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To compensate for growing error rates, both memory manufacturers and consumers develop and incorporate error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet reliability targets. Developing effective error mitigation techniques requires understanding the errors' characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduce new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner. In this dissertation, we experimentally study memory errors, examine how on-die ECC obfuscates their statistical characteristics, and develop new testing techniques to overcome the obfuscation through four key steps. First, we experimentally study DRAM data-retention error characteristics to understand the challenges inherent in understanding and mitigating memory errors that are related to technology scaling. Second, we study how on-die ECC affects these characteristics to develop Error Inference (EIN), a new statistical inference methodology for inferring key details of the on-die ECC mechanism and the raw errors that it obfuscates. Third, we examine the on-die ECC mechanism in detail to understand exactly how on-die ECC obfuscates raw bit error patterns. Using this knowledge, we introduce Bit Exact ECC Recovery (BEER), a new testing methodology that exploits uncorrectable error patterns to (1) reverse-engineer the exact on-die ECC implementation used in a given memory chip and (2) identify the bit-exact locations of the raw bit errors responsible for a set of errors that are observed after on-die ECC correction. Fourth, we study how on-die ECC impacts error profiling and show that on-die ECC introduces three key challenges that negatively impact profiling practicality and effectiveness. To overcome these challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling strategy that uses simple modifications to the on-die ECC mechanism to quickly and effectively identify bits at risk of error. Finally, we conclude by discussing the critical need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs. This dissertation builds a detailed understanding of how on-die ECC obfuscates the statistical properties of main memory error mechanisms using a combination of real-chip experiments and statistical analyses. Our results show that the error characteristics that on-die ECC obfuscates can be recovered using new memory testing techniques that exploit the interaction between on-die ECC and the statistical characteristics of memory error mechanisms to expose physical cell behavior. We hope and believe that the analysis, techniques, and results we present in this dissertation will enable the community to better understand and tackle current and future reliability challenges as well as adapt commodity memory to new advantageous applications.

Publication status

published

Editor

Contributors

Examiner : Mutlu, Onur
Examiner : Erez, Mattan
Examiner : Qureshi, Moinuddin
Examiner : Sridharan, Vilas
Examiner : Weis, Christian

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Memory Reliability; Memory Systems; Memory Errors; DRAM; Computer Engineering; Error Correction; Error Characterization; On-Die ECC; ECC; Error Profiling; Fault Tolerance; Memory Repair; Memory Scaling; System Reliability; Simulation

Organisational unit

09483 - Mutlu, Onur / Mutlu, Onur

Notes

Funding