This is a continuation of the previous post on basics of DRAM technology.. In this post we will look into DDR error types, their sources and how they are mitigated in modern computer systems
DDR Error types and sources
When talking about bit errors on DRAM, there are two type of errors. Hard errors are caused by physical factors, such as excessive temperature variation, voltage stress, or physical stress brought upon the memory bits, and are typically non-correctable. Soft errors are random bit flips, typically associated with alpha particle radiation, solar winds and are generally correctable.
Row hammer issue: This is a very specific type of error source that deserves its own discussion. An access to one memory address should not have unintended side effects on data stored in other addresses. However, as process technology scales down to smaller dimensions, memory chips become more vulnerable to disturbance, a phenomenon in which different memory cells interfere with each other’s operation. Repeatedly reading from the same address in dram could corrupt data in nearby addresses. Specifically, when a DRAM row is opened (i.e., activated) and closed (i.e., pre-charged) repeatedly (i.e., hammered), enough times within a DRAM refresh interval, one or more bits in physically adjacent DRAM rows can be flipped to the wrong value. This DRAM failure mode is now popularly called row-hammer [10][12].
In general, disturbance errors occur whenever there is a strong enough interaction between two circuit components (e.g., capacitors, transistors, wires) that should be isolated from each other. Depending on which component interacts with which other component and also how they interact, many different modes of disturbance are possible. Among them, the [11] paper identifies one particular disturbance mode that affects commodity DRAM chips. When a wordline’s voltage is toggled repeatedly, some cells in nearby rows leak charge at a much faster rate than others. Such vulnerable cells, if disturbed enough times, cannot retain enough charge for even 64ms, the time interval at which they are refreshed. Ultimately, this leads to the cells losing data and experiencing disturbance errors [10].
Error mitigation: Memory RAS features
DDR4 class of memory has a lot performance and reliability knobs. Broadly classifying them into two categories, these controls are present in both the memory modules made by different manufacturer’s (onboard) and in the system (offboard). For example, Cisco UCS systems export of variety of these settings. The onboard settings vary between different manufacturers while the off-board settings vary between systems types, such as Intel architecture, AMD architecture and ARM architecture.
We will try to understand some of these options and their impact to the overall system behavior, such as power and performance.
Between these two categories of settings and features, they help enhance the end-user experience with the platform’s ability to recover from bad data consumption, capabilities in detecting bad instruction and retrying the transaction in attempt to recover [1], as well as mitigating security vulnerabilities.
Memory Mirroring: This is pretty simple and fairly expensive in terms of hardware cost. This involves mirroring all DIMMs, so that in the event of a DIMM failure the server will keep on running. This is only outmatched by more extreme triple-redundant quorum/voting systems used on spaceflight computers. This is only considered for mission-critical systems in extremely difficult to reach places (Submarine, mines, oil rigs, etc).
SDDC, Single Device Data Correction: The x4 SDDC is an ECC algorithm designed to recover from a single DRAM chip failure of the data signals. x4 SDDC can be configured to correct errors in x4 chips or to correct in x8 chips. Data or data pin errors in the same chip are correctable. Double errors across two chips are detectable. The SxEC-DxED algorithm is similar to SEC-DED (x = number of bits, 4 or 8) [4].
Out of the normal 18 memory devices on a DIMM you keep 1 device for CRC and 1 device for parity. If one if the devices fails, its data can be reconstructed. This is called single-device data correction (SDDC). Think of this a bit like RAID 4 (dedicated parity device) with checksums stored on a dedicated device also rather than with the block of data. Note a +1 option, effectively keeps a “hot spare” device so that after a failure is mitigated, you can support another failure.
DDDC, Double Device Data Correction: By combining the single error correction and double error detection (SECDED) capabilities of two ECC-enabled DIMMs in a lockstep layout, their single-device data correction (SDDC) nature can be extended into double-device data correction (DDDC), providing protection against the failure of any single memory chip[5]. By combining two 4x DIMMs into the same memory channel you can run a double parity scheme across both devices.
The approach forms a part of Intel’s lock step memory architecture. Downsides of the Intel’s lockstep memory layout are the reduction of effectively usable amount of RAM (in case of a triple-channel memory layout, maximum amount of memory reduces to one third of the physically available maximum), and reduced performance of the memory subsystem [5].
Address Range Partial Memory Mirroring: This is an intel specific technology with a bit of variety on the implementation depending on the OEM. Unlike DIMM mirroring (which is transparent) this requires a OS –> Firmware interface for the OS to be aware of. For example, this is what the vSphere reliable memory feature enables. How this works under the hood is kernel processes flagged for usage of this will use this memory and be protected up to and including a full DIMM failure. This feature requires higher end Intel Xeon processors [13].
Adaptive Double DRAM Device Correction (ADDDC): Intel® Xeon® processor introduced a new approach in managing errors that the DDR4 DRAM DIMM may induce through the life of the product. ADDDC is deployed at runtime to dynamically map out the failing DRAM device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM longevity. The operation occurs at the fine granularity of DRAM Bank and/or Rank to have minimal impact on the overall system performance.[1]
With the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble. Dependent on the processor SKU, each DDR4 channel supports one to two regions that can manage one or two faulty DRAMs, at Bank and/or full Rank granularity. The dynamic nature of the operation makes the performance implications of the lockstep operation on the system to be material only after the DRAM device is detected to be failing. The overall lockstep impact on system performance is now a function of the number of bad DRAM devices on the channel, with the worst-case scenario of two bad Ranks on every DDR4 channel.[1]
Advanced Error Detection and Correction (AEDC): AEDC improves the fault coverage within the core execution engine by utilizing proprietary residue code fault-detection checking to identify and correct errors the processor may encounter within its internal pipelines within the execution engine (arrays and logic). AEDC will attempt to correct the fault by retrying the instruction. The successfully corrected retry is considered as a corrected event; otherwise, fatal MCERR is logged and signaled. AEDC technology in the processor is self-contained. It uses the existing error signaling and logs to flag errors, and needs no special assistance from the operating system to become operational.[1]
Post Package Repair (PPR): [7], is the capability of the DDR controller to allow the use of spare cells on the DDR silicon, if some of the cells get damaged during installation on the motherboard. The feature results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer and enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.
Previously, this functionality was limited to the manufacturing process. There are certain correctable memory errors that will result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). PPR can be invoked by memory training code of BIOS or by ADDDC.
There should not be a system performance impact due to PPR.
Power and performance consequences
Almost every method to mitigate bit errors on DRAM, whether hard or soft, has a consequence on the overall system power and performance characteristics.
DIMMs per channel impact
We can configure systems to be 1DPC,2DPC or 3DPC. This means that we are adding more and more memory to the system. More memory means more power consumption (more devices that are active / standby at any given time), greater capacitive loading on the data line (meaning 1DPC is better than 2DPC is better than 3DPC in terms of performance).
Correction mechanisms like ADDDC
Algorithms such as ADDDC can be considered as system software features. Depending on how such algorithms are designed, they can have data locality or more frequent refreshes. These design factors eventually decide the system power and performance impacts of such algorithms.
A higher refresh rate might lead to higher power being consumed. If this is going on when there is limited memory access, it might degrade the system performance.
Conclusion
This article provided the error handling aspects of DRAM, in continuation of the topic we started in the previous post. These areas are fast changing and they have tremendous influence how data centers are designed and provisioned. As we move to future DDR5 technology, these issues will only become more challenging, as well as interesting to handle.
References:
- https://software.intel.com/content/www/us/en/develop/articles/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html
- https://en.wikipedia.org/wiki/Double_data_rate
- http://www.intel.com/Assets/PDF/prodbrief/x58-product-brief.pdf
- https://www.intel.com/content/dam/doc/application-note/e7500-chipset-mch-x4-single-device-data-correction-note.pdf
- https://en.wikipedia.org/wiki/Lockstep_(computing)#MEMORY
- https://www.youtube.com/watch?v=kIpQXWTGnHA
- https://www.systemverilog.io/ddr4-basics
- https://en.wikipedia.org/wiki/DIMM
- https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
- https://arxiv.org/pdf/1904.09724.pdf
- Kimet al., “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” in ISCA, 2014
- https://en.wikipedia.org/wiki/Row_hammer
- https://software.intel.com/content/www/us/en/develop/articles/address-range-partial-memory-mirroring.html