Sponsors

Platinum Sponsors

  • Intel
  • IBM

Gold Sponsors

  • NetApp

Silver Sponsors

  • HP
  • Google
  • MontaVista
  • Sandisk

Collaborators

  • Portland State University
  • Linux Foundation

Press Partners

  • Linux Journal
  • Linux Weekly News
  • Linux Pro Magazine

Sponsorship opportunities

For more information on sponsorship opportunities, please contact Angela Brown. Linux Plumbers Conf sponsorship packages.

Are the DIMMs going bad? Making sense of memory error messages. - Max Asböck

Biography

Max Asböck works at the Linux Technology Center at IBM. In the past he has worked on systems management software and has written a Linux device has worked on systems management software and has written a Linux device driver for a service processor. He currently works on Linux RAS (Reliability, Availability, Serviceability) for x86 based system. He also supports customers and occasionally answers questions about the health of DIMMs.

Abstract

Linux reports memory errors through the machine check handler and the EDAC drivers. In many cases easy access to memory error information is beneficial and should be welcomed. It allows for better insight into the health of the hardware. However, memory error reporting in Linux in its current state still has a few issues. The main one being that memory errors are reported without relation to normal expected DIMM error rates. Without this knowledge of the hardware and its error thresholds it is hard to judge if a DIMM is faulty based on a number of reported corrected ECC memory errors. Users will likely be asking the question - Are my DIMMs bad? - after seeing a number of memory errors when in reality the DIMMs are fine and the rate of errors is normal. In fact, the author's experience shows that system administrators will indeed ask this very question.

Therefore it is useful to describe the current state of memory error reporting in Linux, to explain the problems that remain and point to possible enhancements. This talk intends to do this by describing in details the current infrastructure:

  • The machine check handler: it exploits the machine check architecture on x86_64 CPUs and periodically polls status registers for corrected errors. An mcelog user space utility serves to transfer error reports from the kernel and display or log them in readable format.
  • EDAC (Error Detection and Correction): EDAC drivers are chip set specific and poll registers in the memory controller for errors. EDAC logs error messages through printk and reports error counts in sysfs. The edac-util user space package facilitates reading the sysfs files and helps associate the errors with the specific DIMMs.

A number of issues and potential improvements shall be discussed as well:

  • The need for thresholds: This allows for better determination of DIMM failures based on the number of errors seen within a certain time interval.
  • Relating memory errors to specific DIMMs: While both mcelog and edac-utils attempt to do this, there are still open issues.
  • Coordination with the BIOS: On many systems the BIOS performs predictive failure analysis (PFA) for memory. In that case, error reporting by EDAC might be redundant. EDAC also needs to make sure not to interfere with PFA in the BIOS.

The talk hopes to show that with some improvements the current Linux memory error reporting mechanisms can be turned into a reliable instrument for DIMM failure prediction.