13–15 Nov 2018
America/Vancouver timezone

Task Migration at Scale Using CRIU

15 Nov 2018, 14:45
45m
Pavillion/Ballroom-AB (Sheraton Vancouver Wall Center)

Pavillion/Ballroom-AB

Sheraton Vancouver Wall Center

35
Refereed talk LPC Main Track

Speakers

Mr Victor Marmol (Google)Mr Andy Tucker (Google)

Description

The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.

I agree to abide by the anti-harassment policy Yes

Primary authors

Mr Victor Marmol (Google) Mr Andy Tucker (Google)

Presentation materials

Platinum sponsors

Gold sponsors

Silver sponsors

Catchbox sponsor
T-Shirt sponsor