Proposals

Tolerating hardware device failures in software

Session information has not yet been published for this event.

*

One Line Summary

Improving reliability of Linux device drivers against hardware failures and hardware specification bugs

Abstract

In this talk, I will show how the Linux operating system is not robust to the problem of unreliable hardware and describe the tool we have developed that detects these problems and automatically patches them.

Most reliability and bug fixing development tools, target the driver-kernel interface and do not address problems due to hardware issues. The device and driver interact through a protocol specified by the hardware. Failure in adherence to this protocol, or in its implementation can lead to serious security and reliability issues since drivers operate in privileged domain in modern operating systems. Devices may not work as per their specification due transient failures due to device wear-out, electrical interference or bugs in hardware. Additionally, the driver may not even be written from correct specifications. Many OS and device vendors mention various guidelines on how device data should be handled when using inside driver or OS code. The Linux kernel mailing list contains numerous reports of drivers waiting forever and reminders from kernel experts to avoid infinite waits. Microsoft too has identified how having a “hardened” driver reduces incidence of unplanned reboots from 8% to 3%. Applying the same solutions as used for driver software bugs do not address reliability due to unreliable hardware. Hence, in this talk I will assess the nature of hardware unreliability issues that plague modern drivers and describe our tools that address these issues.

To solve the above problem, we have developed Carburizer. Carburizer is a code manipulation tool with an optional runtime that automatically hardens drivers. A hardened driver is one that can survive the failure of its device and if possible, return the device to its full function. Carburizer comprises of a static analysis component and an optional runtime. The static analysis component runs on commodity drivers to detect where the driver uses data from device in critical control or data paths that can potentially cause the system to crash or hang if the device generates corrupt values. We categorize such uses as “hardware dependence bugs”. Carburizer also repairs this code by patching to ensure necessary bounds, range and timeout checks on device data before its risky use. Additionally, Carburizer finds code paths where a driver detects that the device has malfunctioned, by returning an error as a result of a hardware action, and inserts an error reporting statement if the code does not already include one. Error reporting information about device failures is useful for central fault management systems to diagnosis system failures and save debugging time. The result of the static analysis phase is a hardened binary driver that is resistant to hardware failures. Furthermore, the (optional) runtime component ensures that stuck or missing interrupts do not occur by monitoring driver execution and device responses.

We successfully implemented Carburizerand were able to find 992 hardware dependence bugs in the Linux 2.6.18.8 driver tree. We also found approximately 1100 cases where the driver was missing error reporting information. Also, Carburizer runtime imposed less than one half percent CPU overhead when compared to a regular system. When re-running these results on a newer Linux kernel (2.6.34), we found 1120 instances of hardware dependence bugs indicating that this problem continues to persist.

Tags

hardware failures, hardware bugs, code patching

Presentation Materials

slides

Speaker

  • Asim Kadav

    University of Wisconsin-Madison

    Biography

    Asim Kadav is a 4th year PhD student at University of Wisconsin-Madison. At Wisconsin, he has looked at improving the reliability and functionality of device drivers in modern operating systems.