Proposals

Remus - Transparent HA for Xen VMs with Stateful Failover

Session information has not yet been published for this event.

*

One Line Summary

Remus is a HA system for Xen VMs that preserves complete
Guest OS runtime state (including network connections) upon failover.

Abstract

What’s missing in current HA systems?

Current state of the art HA systems employ storage mirroring or active-standby techniques to recover from failures. Storage mirroring solutions require a “cold-boot” of VMs and its applications and throw the onus on applications to ensure “consistent recovery”. Downtime ranges from few minutes to hours.Active-Standby techniques dont require a cold boot of VMs but all existing client connectivity is lost. Replication solutions like MySQL Binlog lose one or more transactions. In general Active-Standby techniques are more suitable for completely “stateless” services.

A stateful, consistency preserving and transparent HA solution is a luxury that can be afforded by only a few (e.g. proprietary solutions like VMWare FT).

What does Remus offer?

Remus offers a totally transparent HA solution for Xen VMs based on shared nothing storage model. It takes checkpoints of a VM at high frequency (20-40 checkpoints/second) and asynchronously replicates them to a backup host. Network output from primary is buffered until checkpoint is committed at backup. Disk writes are replicated and buffered at backup during a checkpoint interval and flushed at checkpoint commit.

a) On failover, the VM at the backup machine “resumes” execution from the last consistent checkpoint, as though the failure never happened. All runtime OS state including active TCP connections are preserved. Ongoing transactions proceed as usual, with no loss of consistency in memory, network state or disk. At the most, the clients’ TCP stack sees packet loss and retransmits them.

b) Remus is integrated with DRBD (added a new checkpoint replication protocol). So, we can leverage DRBD’s efficient storage resynchronization capabilities without any downtime to VMs.

What do Linux HA tools offer?

A robust failure detection service, split brain detection, fencing, quorum service and cluster resource management stack. These features make up an excellent HA toolstack. And these are features that Remus lacks.

Can we get the Best of Both Worlds:-

a) Leverage Remus for efficient replication of VMs and recover all runtime state without loss of consistency or client connectivity.

b) Leverage features of time-tested HA systems like Pacemaker, Heartbeat, etc for smart failure detection, network partition handling and resource/node fencing.

Tags

virtualization, xen, High Availability, Fault Tolerance

Speaker

  • Sfo

    Shriram Rajagopalan

    CS Department, University of British Columbia, Vancouver

    Biography

    Shriram Rajagopalan is a PhD Student in University of British Columbia, Vancouver. His research primarily focuses on High Availability and Fault Tolerance for Virtual Machines. He is also the maintainer of Remus HA system, currently integrated into Xen.