Review of Memory RAS for Data Centers

Jiseong Lee, Min Joon Kim, Woo Seop Kim, Yong Sin Kim

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Multi-bit error and downtime due to uncorrectable error (UE) in a dual in line memory module (DIMM) have received great attention in data centers for its high repair or replacement cost. These problems can be alleviated by utilizing ECC (Error Correction Code) technology, which enables prompt error correction during initial occurrences and prediction of future UEs based on recurring error patterns. The technologies for addressing errors can be categorized into reliability, availability, and serviceability (RAS), and need to be optimized using various parameters such as accuracy, recall, F-measures, and cost reduction. This paper describes an overview of the current RAS technologies and trends in memory for data centers, which includes an analysis of conventional ECC technologies and their recent developments. Once UEs cannot be completely eliminated with ECCs, page offline methods based on analysis on error patterns and characterization of UE can be performed. Recent research trends for reducing memory capacity wasted by UE and page offline have been towards on-die ECC in high bandwidth memory architecture.

    Original languageEnglish
    Pages (from-to)124782-124796
    Number of pages15
    JournalIEEE Access
    Volume11
    DOIs
    Publication statusPublished - 2023

    Bibliographical note

    Publisher Copyright:
    © 2013 IEEE.

    Keywords

    • Correctable error (CE)
    • availability
    • error correction code (ECC)
    • memory reliability
    • serviceability (RAS)
    • uncorrectable error (UE)

    ASJC Scopus subject areas

    • General Computer Science
    • General Materials Science
    • General Engineering

    Fingerprint

    Dive into the research topics of 'Review of Memory RAS for Data Centers'. Together they form a unique fingerprint.

    Cite this