# Linux Plumbers Conference 2020

US/Pacific
Description

## August 24-28, virtually

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.  LPC 2020 will be held virtually August 24-28.  We are looking forward to seeing you online!

• Monday, 24 August
• 07:00 11:00
Android MC Microconference2/Virtual-Room (LPC 2020)

### Microconference2/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
BOFs Session BOF1/Virtual-Room (LPC 2020)

### BOF1/Virtual-Room

#### LPC 2020

150
• 07:00
BoF: RCU Implementation 45m

This is a gathering to discuss Linux-kernel RCU internals.

The exact topics depend on all of you, the attendees. In 2018, the focus was entirely on the interaction between RCU and the -rt tree. In 2019, the main gathering had me developing a trivial implementation of RCU on a whiteboard, coding-interview style, complete with immediate feedback on the inevitable bugs.

Come (virtually!) and see what is in store in 2020!

• 07:45
Break (15 minutes) 15m
• 08:00
BoF: KernelCI Unified Reporting in Action 45m

See the Kernel CI's new Unified Reporting in action: from multi-CI submission, through common dashboards and notification subscription, to report emails.

Explore and discuss the report schema and protocol. Learn how to send testing results, using your own, or example data. Help us accommodate your reporting requirements in the schema, database, dashboards and emails.

Bootstrap automatic sending of your system's results to the common database, with our help. Discuss future development, dive into implementation details, explore and hack on the code, together with the development team.

Speaker: Nikolai Kondrashov (Red Hat)
• 08:45
Break (15 minutes) 15m
• 09:00
LLVM BOF 45m

Come join us to work through issues specific to building the Linux kernel with LLVM. In addition to our Micro Conference, let's carve out time to follow up on unresolved topics from our meetup in February:

• Status of each architecture
• Call to action / how to get started / Evangelism
• Improving Documentation/
• Maintainer model
• Minimum supported versions of LLVM
• s390 virtualized testing
• Follow ups to Rust in Kernel MC session

Potential Attendees: Nathan Chancellor, Sedat Dilek, Masahiro Yamada, Sami Tolvanen, Kees Cook, Arnd Bergmann, Vasily Gorbik.

Speakers: Nick Desaulniers (Google), Behan Webster (Converse in Code Inc.)
• 09:45
Break (15 minutes) 15m
• 07:00 11:00
Containers and Checkpoint/Restore MC Microconference1/Virtual-Room (LPC 2020)

### Microconference1/Virtual-Room

#### LPC 2020

150

The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

Common discussions topic tend to be improvement to the user namespace, opening up more kernel functionalities to unprivileged users, new ways to dump and restore kernel state, Linux Security Modules and syscall handling.

• 07:00 11:00
GNU Tools Track GNU Tools track/Virtual-Room (LPC 2020)

### GNU Tools track/Virtual-Room

#### LPC 2020

150

The GNU Tools track will gather all GNU tools developers, to discuss current/future work, coordinate efforts, exchange reports on ongoing efforts, discuss development plans for the next 12 months, developer tutorials and any other related discussions.
The track will also include a Toolchain Microconference on Friday to discuss topics that are more specific to the interaction between the Linux kernel and the toolchain.

• 07:00
BoF: GDB 25m

GDB BoF, for GDB developers to meet and discuss any topic about the GDB development process.

Some proposed discussion topics are:

• The moving of gdbsupport and gdbserver, is anything left? Is there anything more to moved from the gdb to gdbsupport?
• Replacing of macros with a more C++-like API (like what has been started from the type system). Other C++-ification.
• Feedback on the new version numbering scheme.
• Large changes that people would like to pre-announce.
• Unsure how to approach the task of contributing an upstream port? This would be a good time to ask.

But really this is about what you want to discuss, so don't hesitate to propose more topics. Please notify the moderator (Simon Marchi) in advance if possible, just so we can get a good overview of what people want to talk about.

Speaker: Simon Marchi (EfficiOS)
• 07:25
Break (5 minutes) 5m
• 07:30
BoF: DWARF5/DWARF64 25m

Can we switch to DWARF5 by default for GCC11? Which benefits does that bring? Which features work, which don't (LTO/early-debug, Split-Dwarf, debug-types, debug_[pub]names, etc.). Which DWARF consumers support DWARF5 (which don't) and which features can be enabled by default?

Additionally some larger applications are hitting the limits of 32bit offsets on some arches. Should we introduce a -fdwarf(32|64) switch, so users can generate DWARF32 or DWARF64? And/Or are there other ways to reduce the offset size limits that we should explore?

I'll provide an overview and preliminary answers/patches for the above questions and we can discuss what the (new) defaults should be and which other DWARF5/DWARF64 questions/topics should be answered and/or worked on.

Speaker: Mark Wielaard
• 07:55
Break (5 minutes) 5m
• 08:00
Lightning Talk: elfutils debuginfod http-server progress: clients and servers 10m

We will recap the elfutils debuginfod server from last year. It has been integrated into a number of consumers, learned to handle a bunch of distro packaging formats, and some public servers are already online.

Speakers: Frank Eigler, Aaron Merey (Red Hat)
• 08:10
Break (5 minutes) 5m
• 08:15
Lightning Talk: Teaching GraalVM DWARFish : Debugging Native Java using gdb 10m

Or is it DWARVish? Whatever, GraalVM Native implements compilation of a
complete suite of Java application classes to a single, complete, native
ELF image. It's much like how a C/C++ program gets compiled. Well,
except that the image contains nothing to explain how the bits were
derived from source types and methods or where those elements were
defined. Oh and the generated code is heavily inlined and optimized
(think gcc -O2/3). Plus many JDK runtime classes and methods get
substituted with lightweight replacements. So, a debugging nightmare.

Anyway, we have resolved the debug problem much like how you do with
C/C++ by generating DWARF records to accompany and explain the program
bits. So far, we have file and line number resolution, breakpoints,
single stepping & stack backtraces. We're now working on type names and
layouts and type, location & liveness info for heap-mapped
values/objects, parameters and local vars. I'll explain how we obtain
the necessary input from the Java compiler, how we model it as DWARF
records and how we test it for correctness using objdump and gdb itself.
By that point I will probably need to stop to take a breath

Speaker: Andrew Dinn (Red hat)
• 08:25
Break (5 minutes) 5m
• 08:30
The Light-Weight JIT Compiler Project 25m

Recently CRuby got a JIT based on GCC or Clang. Experience with use of the CRuby JIT confirmed the known fact that GCC does not fit well for all JIT usage scenarios. Ruby needs a light-weight JIT compiler used as a tier 1 compiler or as a single JIT compiler. This talk will cover experience of GCC usage for CRuby JIT and drawbacks of GCC as a tier 1 JIT compiler. This talk also will cover the light-weight JIT compiler project motivations, current and possible future states of the project.

• 08:55
Break (5 minutes) 5m
• 09:00
Project Ranger Update 25m

The Ranger project was introduced at the GNU tools Cauldron last year. This project provides GCC with enhanced ranges and an on-demand range query API. By the time the conference is on, we expect to have the majority of the code in trunk and available for other passes to utilize.

In this update, we will:

• Cover what has changed since last fall.
• Describe current functionality, including the API that is available for use.
• Plans going forward / Whats in the pipe.
Speakers: Aldy Hernandez (Red Hat), Andrew MacLeod (Red Hat)
• 09:25
Break (5 minutes) 5m
• 09:30
Tutorial: GNU poke, what is new in 2020 55m

It's been almost a year since the nascent GNU poke [1] got first introduced to the public at the GNU Tools Cauldron 2019 in Montreal. We have been hacking a lot during these turbulence months and poke is maturing fast and approaching a first official release, scheduled for late summer.

In this talk we will first do a quick introduction to the program for the benefit of the folk still unfamiliar with it. Then we will show (and demonstrate) the many new features introduced during this last year: full support for union types, styled output, struct constructors, methods and pretty-printers, integral structs, the machine-interface, support for Poke scripts, and many more. Finally, we will be tackling some practical matters (what we call "Applied Pokology"[2]) useful for toolchain developers, such as how to write binary utilities in Poke, how to best implement typical C data structures in Poke type descriptions, and our plans to
integrate poke with other toolchain components such as GDB.

GNU poke is an interactive, extensible editor for binary data. Not limited to editing basic entities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them.

[1] http://www.jemarch.net/poke

[2] http://www.jemarch.net/pokology

Speaker: Jose E. Marchesi (GNU Project, Oracle Inc.)
• 07:00 11:00
LPC Refereed Track Refereed Track/Virtual-Room (LPC 2020)

### Refereed Track/Virtual-Room

#### LPC 2020

150
• 07:00
A theorem for the RT scheduling latency (and a measuring tool too!) 45m

Defining Linux as an RTOS might be risky when we are outside of the kernel community. We know how and why it works, but we have to admit that the black-box approach used by cyclictest to measure the PREEMPT_RT’s primary metric, the scheduling latency, might not be enough for trying to convince other communities about the properties of the kernel-rt.

In the real-time theory, a common approach is the categorization of a system as a set of independent variables and equations that describe its integrated timing behavior. Two years ago, Daniel presented a model that could explain the relationship between the kernel events and the latency, and last year he showed a way to observe such events efficiently. Still, the final touch, the definition of the bound for the scheduling latency of the PREEMPT_RT using an approach accepted by the theoretical community was missing. Yes, it was.

Closing the trilogy, Daniel will present the theorem that defines the scheduling latency bound, and how it can be efficiently measured, not only as a single value but as the composition of the variables that can influence the latency. He will also present a proof-of-concept tool that measures the latency. In addition to the analysis, the tool can also be used in the definition of the root cause of latency spikes, which is another practical problem faced by PREEMPT_RT developers and users. However, discussions about how to make the tool more developers-friendly are still needed, and that is the goal of this talk.

The results presented in this talk was published at the ECRTS 2020, a top-tier academic conference about real-time systems, with reference to the discussions made in the previous edition of the Linux Plumbers.

Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
• 07:45
Break 15m
• 08:00
Morello and the challenges of a capability-based ABI 45m

The Morello project is an experimental branch of the Arm architecture for evaluating the deployment and impact of capability-based security. This experimental ISA extension builds on concepts from the CHERI project from Cambridge University.

As experimentations with Morello on Linux are underway, this talk will focus on the pure-capability execution environment, where all pointers are represented as 128-bit capabilities with tight bounds and limited permissions. After a brief introduction to the Morello architecture, we will outline the main challenges to overcome for the kernel to support a pure-capability userspace. Beyond the immediate issue of adding a syscall ABI where all pointers are 128 bits wide, the kernel is expected to honour the restrictions associated with user capability pointers when it dereferences them, in order to prevent the confused deputy problem.

These challenges can be approached in multiple ways, with different trade-offs between robustness, maintainability and invasiveness. We will attempt at covering a few of these approaches, in the hope of generating useful discussions with the community.

Speaker: Kevin Brodsky (Arm)
• 08:45
Break 15m
• 09:00
Core Scheduling: Taming Hyper-Threads to be secure 45m

The core idea behind core scheduling is to have SMT (Simultaneous Multi Threading) on and make sure that only trusted applications run concurrently on the hardware threads of a core. If there is no group of trusting applications runnable on the core, we need to make sure that remaining hardware threads are idle while applications run in isolation on the core. While doing so, we should also consider the performance aspects of the system. Theoretically it is impossible to reach the same level of performance where all hardware threads are allowed to run any runnable application. But if the performance of core scheduling is worse than or the same as that without SMT, we do not gain anything from this feature other than added complexity in the scheduler. So the idea is to achieve a considerable boost in performance compared to SMT turned off for the majority of production workloads.

This talk is continuation of the core scheduling talk and micro-conference at LPC 2019. We would like to discuss the progress made in the last year and the newly identified use-cases of this feature.

Progress has been made in the performance aspects of core scheduling. Couple of patches addressing the load balancing issues with core scheduling, have improved the performance. And stability issues in v5 have been addressed as well.

One area of criticism was that the patches were not addressing all cases where untrusted tasks can run in parallel. Interrupts are one scenario where the kernel runs on a cpu in parallel with a user task on the sibling. While two user tasks running on the core could be trusted, when an interrupt arrives on one cpu, the situation changes. Kernel starts running in interrupt context and the kernel cannot trust the user task running on the other sibling cpu. A prototype fix has been developed to fix this case. One gap that still exists is the syscall boundary. Addressing the syscall issue would be a big hit to performance, and we would like to discuss possible ways to fix it without hurting performance.

Lastly, we would also like to discuss the APIs for exposing this feature to userland. As of now, we use CPU controller CGroups. During the last LPC, we had discussed this in the presentation, but we had not decided on any final APIs yet. ChromeOS has a prototype which uses prctl(2) to enable the core scheduling feature. We would like to discuss possible approaches suitable for all use cases to use the core scheduling feature.

Speakers: Vineeth Remanan Pillai (DigitalOcean), Julien Desfossez (DigitalOcean), Joel Fernandes
• 09:45
Break 15m
• 10:00
Data-race detection in the Linux kernel 45m

In this talk, we will discuss data-race detection in the Linux kernel. The talk starts by briefly providing background on data races, how they relate to the Linux-kernel Memory Consistency Model (LKMM), and why concurrency bugs can be so subtle and hard to diagnose (with a few examples). Following that, we will discuss past attempts at data-race detectors for the Linux kernel and why they never reached production quality to make it into the mainline Linux kernel. We argue that a key piece to the puzzle is the design of the data-race detector: it needs to be as non-intrusive as possible, simple, scalable, seamlessly evolve with the kernel, and favor false negatives over false positives. Following that, we will discuss the Kernel Concurrency Sanitizer (KCSAN) and its design and some implementation details. Our story also shows that a good baseline design only gets us so far, and most important was early community feedback and iterating. We also discuss how KCSAN goes even further, and can help detect concurrency bugs that are not data races.

Tentative Outline:
- Background
-- What are data races?
-- Concurrency bugs are subtle: some examples
- Data-race detection in the Linux kernel
-- Past attempts and why they never made it upstream
-- What is a reasonable design for the kernel?
-- The Kernel Concurrency Sanitizer (KCSAN)
--- Design
--- Implementation
-- Early community feedback and iterate!
- Beyond data races
-- Concurrency bugs that are not data races
-- How KCSAN can help find more bugs
- Conclusion

Keywords: testing, developer tools, concurrency, bug detection, data races
References: https://lwn.net/Articles/816850/, https://lwn.net/Articles/816854/

• 07:00 11:00
Networking and BPF Summit Networking and BPF Summit/Virtual-Room (LPC 2020)

### Networking and BPF Summit/Virtual-Room

#### LPC 2020

150
• 07:00
Traceloop and BPF 45m

We will present traceloop, a tool for tracing system calls in cgroups or in containers using in-kernel Berkeley Packet Filter (BPF) programs.

Many people use the “strace” tool to synchronously trace system calls using ptrace. Traceloop similarly traces system calls but with low overhead (no context switches) and asynchronously in the background, using BPF and tracing per cgroup. We will show how it is integrated with Kubernetes via Inspektor Gadget.

Traceloop's traces are recorded in perf ring buffers (BPF_MAP_TYPE_PERF_EVENT_ARRAY) configured to be overwritable like a flight recorder. As opposed to “strace”, the tracing is permanently enabled on Kubernetes pods but rarely read, only on-demand, for example in case of a crash.

We will present both past limitations with their workarounds, and how new BPF features can improve traceloop. This includes:

• Lack of bpf_get_current_cgroup_id() on Linux < 4.18 and systems not using cgroup-v2. Workaround using the mount namespace id.
• New BPF programs can only be inserted in a PROG_ARRAY map from userspace, making synchronous updates more complicated.
• BPF ringbuffer to replace BPF perf ringbuffer to improve memory usage.
Speakers: Alban Crequy (Kinvolk), Kai Lüke (Kinvolk)
• 07:45
Packet mark in the Cloud Native world 45m

The 32-bit "mark" associated with the skb has served as a metadata exchange format for Linux networking subsystems since the beginning of the century. Over that time, the interpretation and reuse of the field has grown to encapsulate a wide range of networking use cases, expanding to touch everything from iptables, tc, xfrm, openvswitch, sockets, routing, to eBPF. In recent years, more than a dozen network control applications have been written in the Cloud Native space alone, many of which are using the packet mark in different ways to solve networking problems. The kernel facilities define no specific semantics to these bits, which leaves it up to these applications to co-ordinate to avoid incompatible mark usage.

This talk will explore use cases for sharing metadata between Linux subsystems in light of recent containerization trends, including but not limited to: application identity, firewalling, ip masquerade, network isolation, service proxying and transparent encryption. Beyond that, Cilium's particular usage will be discussed with approaches used to mitigate conflicts due to the inevitable overload of the mark.

Speaker: Joe Stringer (Cilium.io)
• 08:30
Break 30m
• 09:00
Evaluation of tail call costs in eBPF 45m

We would like to present results of an estimation of tail calls costs between eBPF programs. This was carried out for two kernel versions, 5.4 and 5.5. The latter introduces an optimization to remove the retpoline mitigating spectre flaws, in certain conditions. The numbers come from 2 benchmarks, executed over our eBPF software stack. The first one uses the in-kernel testing BPF_PROG_TEST_RUN. The second one uses kprobes, network namespaces and iperf3 to get figures from a production-like environment. The conditions to trigger the optimization from kernel 5.5 were met in both cases, resulting in a drop of the cost of one tail call from 20-30 ns to less than 10 ns.

More recent techniques to estimate CPU time cost of eBPF programs would be covered, as well as other improvements to the measurement system. At Cloudflare we have production deployment of eBPF programs with multiple tail calls. Thus, estimating and limiting the cost of these is important from a business perspective. As a result, examples of strategies used or considered to limit costs associated with tail calls would be outlined in the presentation too.

The desired outcome from the discussion is to get feedback on the methods deployed, both for benchmarks and to limit tail calls.

As this work is part of an internship for a master thesis, a paper would be written with the relevant elements of the thesis.

This would be a relatively short presentation, 20 minutes long, including questions.

Speaker: Clément Joly (Cloudflare)
• 09:45

In the proposed talk I would like to discuss the opportunity to create a core for XDP program offloading from a guest to a host. The main goal here is to increase packet processing speed.

There was an attempt to merge offloading for virtio-net but the work is in progress.
After addition XDP processing to the xen-netfront driver the similar Xen task has to be solved as well.
vmxnet3 driver currently doesn't support XDP processing but after adding it the same problem has to be solved there.

Speaker: Mr Denis Kirjanov
• 07:00 11:00
Real-time MC Microconference3/Virtual-Room (LPC 2020)

### Microconference3/Virtual-Room

#### LPC 2020

150
• Tuesday, 25 August
• 07:00 11:00
BOFs Session BOF1/Virtual-Room (LPC 2020)

### BOF1/Virtual-Room

#### LPC 2020

150
• 07:00
BoF: upstream drivers for open source FPGA SoC peripherals 45m

There are active open source projects such as LiteX which have developed IP (e.g. chip-level hardware design) needed for building an open source SoC. The common workflow is that this SoC would be synthesized into a bitstream and loaded into a FPGA. (Aside: there is also the possibility of using these IP modules in an ASIC, but the scenario of supporting fixed-in-silicon hardware peripherals is already well-established in Linux).

The scenario of an open source SoC in a FPGA raises a question:

What is the best trade-off between complexity in the hardware peripheral IP and the software drivers?

Open source SoC design is done in a Hardware Description Language (HDL) with Verilog, VHDL, SystemVerilog or even newer languages (Chisel, SpinalHDL, Migen). This means we have the source and toolchain necessary to regenerate the design.

LiteX [1] is a good example of an open source SoC framework where it provides IP for common peripherals like DRAM controller, Ethernet, PCIe, SATA, SD Card, Video and more. A key design decision for these peripherals are Control and Status Registers (CSR). The hardware design and the software drivers must agree on the structure of these CSRs.

The Linux kernel drivers for LiteX are currently being developed out-of-tree [2]. A sub-project called Linux-on-LiteX-Vexriscv [3] combines the Vexrisv core (32-bit RISC-V), LiteX modules, and a build system which results in a FPGA bitstream, kernel and rootfs.

There is an long-running effort led by Mateusz Holenko of Antmicro to land the LiteX drivers upstream starting with the LiteX SoC controller and LiteUART serial driver [4]. Recently, support for Microwatt, a POWER-based core from IBM, was been added to LiteX and Benjamin Herrenschmidt has rekindled discussion [5] of how best structure the LiteX CSRs and driver code for upstream. In addition, an experienced Linux graphics developer, Marin Perens, has jumped into the scene with a LiteDIP [6]: "Plug-and-play LiteX-based IP blocks enabling the creation of generic Linux drivers. Design your FPGA-based SoC with them and get a (potentially upstream-able) driver for it instantly!"

Martin has blog posts that dives further into the issues I've tried to describe above: "FPGA: Why So Few Open Source Drivers for Open Hardware?" [7]

I think this BoF will be useful in accelerating the discussion that is happening on different mailing lists and hopefully bringing us closer to consensus.

[1] https://github.com/enjoy-digital/litex
[2] https://github.com/litex-hub/linux/commits/litex-vexriscv-rebase/drivers
[3] https://github.com/enjoy-digital/litex
[4] https://lkml.org/lkml/2020/6/4/303
[6] https://gitlab.freedesktop.org/mupuf/litedip/
[7] https://mupuf.org/blog/2020/06/09/FPGA-why-so-few-drivers/

Speaker: Mr Drew Fustini (BeagleBoard.org Foundation)
• 07:45
Break (15 minutes) 15m
• 08:00
BoF: ASI: Efficiently Mitigating Speculative Execution Attacks with Address Space Isolation 45m

Speculative execution attacks, such as L1TF, MDS, LVI pose significant security risk to hypervisors and VMs. A complete mitigation for these attacks requires very frequent flushing of buffers (e.g., L1D cache) and halting of sibling cores. The performance cost of such mitigations is unacceptable in realistic scenarios. We are developing a high-performance security-enhancing mechanism to defeat speculative attack which we dub Address Space Isolation (ASI). In essence, ASI is an alternative way to manage virtual memory for hypervisors, providing very strong security guarantees at a minimal performance cost. In the talk, we will discuss the motivation for this technique as well as initial results we have.

• 08:45
Break (15 minutes) 15m
• 09:00
BoF: Synchronizing timestamps of trace events between host and guest VM 45m

Synchronization of kernel trace event timestamps between host and guest VM is a key requirement for analyzing the interaction between host and guest kernels. The task is not trivial, although both kernels run on the same physical hardware. There is a non-linear scaling of the guest clock, implemented intentionally by the hypervisor in order to simplify live guest migration to another host.
I'll describe in short our progress on this task, using a PTP-like algorithm for calculating trace events timestamp offset. Any new ideas, comments, suggestions are highly welcomed.

Speaker: Tzvetomir Stoyanov
• 09:45
Break (15 minutes) 15m
• 10:00
BoF: IPE (Integrity Policy Enforcement) LSM merge discussion 45m

Gather stakeholders from security, block, and VFS to discuss potential merging of the IPE LSM vs. integration with IMA.

Background:

• IPE: https://microsoft.github.io/ipe/
Speakers: James Morris, Mimi Zohar (IBM)
• 07:00 11:00
GNU Tools Track GNU Tools track/Virtual-Room (LPC 2020)

### GNU Tools track/Virtual-Room

#### LPC 2020

150

The GNU Tools track will gather all GNU tools developers, to discuss current/future work, coordinate efforts, exchange reports on ongoing efforts, discuss development plans for the next 12 months, developer tutorials and any other related discussions.
The track will also include a Toolchain Microconference on Friday to discuss topics that are more specific to the interaction between the Linux kernel and the toolchain.

• 07:00
BoF: Binutils 25m

A BoF meeting for folks interested in the GNU Binutils.
Possible topics for discussion:
* Should GOLD be dropped ?
* Automatic changelog generation.
* Configuring without support for old binary formats (eg ihex, srec, tekhex, verilog)

Speaker: Nick Clifton
• 07:25
Break (5 minutes) 5m
• 07:55
Break (5 minutes) 5m
• 08:25
Break (5 minutes) 5m
• 08:30
Lightning Talk: Fuzzing glibc's iconv program 10m

A while back, I found myself triaging an iconv bug report that found hangs
in the program when run with certain inputs. Not knowing a lot about iconv
internals, I wrote a rudimentary fuzzer to investigate the problem, which
caught over 160 different input combinations that led to hangs and a clear
pattern hinting at the cause.

In this short talk, I'll share my experiences with fuzzing iconv and
eventually cleaning up some of the iconv front-end with a patch.

Speaker: Arjun Shankar (Red Hat)
• 08:40
Break (5 minutes) 5m
• 08:55
Break (5 minutes) 5m
• 09:25
Break (5 minutes) 5m
• 09:30
New frontiers in CTF linking: type deduplication 25m

Last year we introduced support for the Compact C Type Format (CTF) to the GNU toolchain and presented at the last Cauldron.

Back then, the binutils side was only doing slow, non-deduplicating linking and format dumping, but things have moved on. The libctf library and ld in binutils has gained the ability to properly deduplicate CTF: output CTF in linked ELF objects is now often smaller than the CTF in any input object file. The performance hit of deduplication is usually in the noise or at least no more than a second or two (and there are still some easy performance wins to pick).

The libctf API has also improved somewhat, with support for a number of missing features, improved error reporting, and a much-improved way to iterate over things in the CTF world.

This talk will provide an overview of the novel type deduplication algorithm used to reduce the size of CTF, with occasional diversions into the API improvements where necessary, and (inevitably) discussion of upcoming work in the area, solicitations of advice from others working on similar things, etc.

Speaker: Nick Alcock (Oracle Corporation)
• 09:55
Break (5 minutes) 5m
• 10:00
GCC's -fanalyzer option 25m

I'll be talking about the -fanalyzer static analysis option I added in
GCC 10; give an overview of the internal implementation, its current
strengths and limitations, on how I'm reworking it for GCC 11, and
ideas for future directions.

Speaker: David Malcolm (Red Hat)
• 07:00 11:00
Kernel Dependability & Assurance MC Microconference2/Virtual-Room (LPC 2020)

### Microconference2/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
LPC Refereed Track Refereed Track/Virtual-Room (LPC 2020)

### Refereed Track/Virtual-Room

#### LPC 2020

150
• 07:00
Write once, herd everywhere 45m

With Linux Kernel Memory Model introduced into kernel, litmus tests have been proven to be a powerful tool to analyze and design parallel code. More and more C litmus tests are written, some of which are merged into Linux mainline.

Actually the herd tool behind LKMM have models for most of mainstream architectures: litmus tests in asm code are supported. So in theory, we can verify a litmus test in different versions (C and asm code), and this will help us on 1) verifying the correct of LKMM and 2) test the implementation of parallel primitives in a particular architecture, by comparing the results of exploring the state spaces of different versions of litmus tests.

This topic will present some work to make it possible to translate between limuts tests (mostly C to asm code). The work provides an interface for architecture maintainers to provide their rules for the litmus translation, in this way, we can verify the consistency between LKMM and the implementation of parallel primitives, and this could also help new architectures to provide parallel primitives consistent with LKMM.

This topic will introduce the overview of the translation and hopefully some discussion will be made during or after the topic on the interface.

Speaker: Boqun Feng
• 07:45
Break 15m
• 08:00
Desktop Resource Management (GNOME) 45m

Graphical user sessions have been plagued with various performance related issues. Sometimes these are simply bugs, but often enough issues arise because workstations are loaded with other tasks. In this case a high memory, IO or CPU use may cause severe latency issues for graphical sessions. In the past, people have tried various ways to improve the situation, from running without swap to heuristically detecting low memory situations and triggering the OOM. These techniques may help in certain conditions but also have their limitations.

GNOME and other desktops (currently KDE) are moving towards managing all applications using systemd. This change in architecture also means that every application is placed into a separate cgroup. These can be grouped to separate applications from essential services and they can also be adjusted dynamically to ensure that interactive applications have the resources they need. Examples of possible interventions are allocating more CPU weight to the currently focused application, creating memory and IO latency guarantees for essential services (compositor) or running oomd to kill applications when there is memory pressure.

The talk will look at what GNOME (and KDE) currently does in this regard and how well it is working at at this point so far. This may show areas where further improvements in the stack are desirable.

Speaker: Benjamin Berg
• 08:45
Break 15m
• 09:00
Configuring a kernel for safety critical applications 45m

For security there are various projects which provide guidelines on how to configure a secure kernel - e.g., Linux Self Protection Project. In addition there are security enhancements which have been added to the Linux kernel by various groups - e.g., grsecurity or PAX security patch.
We are looking to define appropriate guidelines for safety enhancements to the Linux kernel. The session will focus on the following:
1. Define the use cases (primarily in automotive domain) and the need for safety features.
2. Define criteria for safe kernel configurations.
3. Define a preliminary proposal for a serious workgroup to define requirements for relevant safety enhancements.
Note that the emphasis is 100% technical, and not related in any way to safety assessment processes. I will come with an initial set of proposals, to be discussed and for follow up.

Speaker: Dr Elana Elana Copperman (Mobileye)
• 09:45
Break 15m
• 10:00

First investigations about Kernel Address Space Isolation (ASI) were presented at LPC last year as a way to mitigate some cpu hyper-threading data leaks possible with speculative execution attacks (like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS)). In particular, Kernel Address Space Isolation aims to provide a separate kernel address space for KVM when running virtual machines, in order to protect against a malicious guest VM attacking the host kernel using speculative execution attacks.

https://www.linuxplumbersconf.org/event/4/contributions/277/

At that time, a first proposal for implementing KVM Address Space Isolation was available. Since then, new proposals have been submitted. The implementation have become much more robust and it now provides a more generic framework which can be used to implement KVM ASI but also Kernel Page Table Isolation (KPTI).

Currently, RFC version 4 of Kernel Address Space Isolation is available. The proposal is divided into three parts:

• Part I: ASI Infrastructure and PTI

https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com/
• Part II: Decorated Page-Table

https://lore.kernel.org/lkml/20200504145810.11882-1-alexandre.chartre@oracle.com/
• Part III: ASI Test Driver and CLI

https://lore.kernel.org/lkml/20200504150235.12171-1-alexandre.chartre@oracle.com/

This presentation will show progress and evolution of the Kernel Address Space Isolation project, detail the kernel ASI framework and how it is used to implement KPTI and KVM ASI. It also looks forward to discuss possible way to integrate the project upstream, concerns about making changes in some of the nastiest corners of the x86, and kernel page table management improvement, in particular page table creation and population.

Speaker: Alexandre Chartre (Oracle)
• 07:00 11:00
Networking and BPF Summit Networking and BPF Summit/Virtual-Room (LPC 2020)

### Networking and BPF Summit/Virtual-Room

#### LPC 2020

150
• 07:00
The way to d_path helper 45m

The d_path is eBPF tracing helper, that returns string with
full path for given 'struct path' object and was requested
long time ago by many people.

Along the way of implementing it, other features had to be

• compile time BTF IDs resolving

This allows using of kernel objects BTF IDs without resolving
them in runtime and saves few cycles on resolving during kernel
startup and introducing single interface for accessing such IDs

• allow to pass BTF ID + offset as helper argument

This allows to pass an argument to helper, which is defined via parent
BTF object + offset, like for bpf_d_path (added in following changes):

SEC("fentry/filp_close")
int BPF_PROG(prog_close, struct file file, void id)
{
...
ret = bpf_d_path(&file->f_path, ...

In this talk I'll show implementation details of d_path helper
and details of both aforementioned features and why they are
important for d_path helper.

Speaker: Jiri Olsa
• 07:45
NetGPU 45m

This introduces a working proof-of-concept alternative to RDMA, implementing a zero-copy DMA transfer between the NIC and GPU, while still performing the protocol processing on the host CPU. A normal NIC/host memory implementation is also presented.

By offloading most of the data transfer from the CPU, while not needing to reimplement the protocol stack, this should provide a balance between high performance and feature flexibility.

This presentation would cover the changes needed across the kernel; mm support, networking queues, skb handling, protocol delivery, and a proposed interface for zero-copy RX of data which is not directly accessible by the host CPU. It would also solicit input for further API design ideas in this area.

A paper is planned. This proposal was originally submitted for the main track and was recommended for the networking track instead.

• 08:30
Break 30m
• 09:00
Multidimensional fair-share rate limiting in BPF 45m

As UDP does not have flood attack protections such as SYN cookies, we developed a novel fair-share ratelimiter in unprivileged BPF, designed for a UDP reverse proxy, that is capable of applying rate limits to specific traffic streams while minimizing the impact on others. To achieve this, we base our work on Hierarchical Heavy Hitters, which proposes a method to group packets on source and destination IP address, and we are able to substantially simplify the algorithm for our rate-limiting use case in order to allow for an implementation in BPF.
We further extend the concept of a hierarchy from IPs addresses to ports, providing us with precise rate limits based on the 4-tuple.

Our approach is capable of rate limiting floods originating from single addresses, subnets but also reflection attacks, and applies limits as specific as possible. To verify it’s performance we evaluated the approach against different simulated scenarios.
The outcome of this project is a single library that can be activated on any UDP socket and provides a flood protection out of the box.

Speakers: Jonas Otten (Cloudflare), Lorenz Bauer (Cloudflare)
• 09:45
BPF LSM (Updates + Progress) 45m

The BPF LSM or Kernel Runtime Security Instrumentation (KRSI) aims to provide an extensible LSM by allowing privileged users to attach eBPF programs to security hooks to dynamically implement MAC and Audit Policies.

KRSI was introduced in LSS-US 2019 and has since then had multiple interesting updates and triggered some meaningful discussions. The talk provides an update on:

• Progress in the mainline kernel, the ongoing discussions, and a recap of the
interesting discussions that were resolved.
• New infrastructure merged into BPF to support the BPF LSM use-case.
• Some optimisations that can improve the performance characteristics of the
currently existing LSM framework which would not only benefit KRSI
but also all other LSMs.

The talk showcases how the design has evolved over time and what trade-offs were considered and what's upcoming after the initial patches are merged.

Speaker: Mr KP Singh
• 07:00 11:00
Scheduler MC Microconference1/Virtual-Room (LPC 2020)

### Microconference1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
linux/arch/* MC Microconference3/Virtual-Room (LPC 2020)

### Microconference3/Virtual-Room

#### LPC 2020

150
• Wednesday, 26 August
• 07:00 11:00
BOFs Session BOF1/Virtual-Room (LPC 2020)

### BOF1/Virtual-Room

#### LPC 2020

150
• 07:00
BoF: Android MC BoF 45m

This is a placeholder for the Android MC follow-up BoF that should be scheduled to run 48 to 72 hours after the Android MC.

Speakers: John Stultz (Linaro), Todd Kjos (Google), Lina Iyer, Sumit Semwal, Karim Yaghmour (Opersys inc.)
• 07:45
Break (15 minutes) 15m
• 08:00
BoF: Show off your pets! 45m

It's not an evening social but pets are good. Stop by and show off your pets on video camera!

Speaker: Laura Abbott
• 07:00 11:00
GNU Tools Track GNU Tools track/Virtual-Room (LPC 2020)

### GNU Tools track/Virtual-Room

#### LPC 2020

150

The GNU Tools track will gather all GNU tools developers, to discuss current/future work, coordinate efforts, exchange reports on ongoing efforts, discuss development plans for the next 12 months, developer tutorials and any other related discussions.
The track will also include a Toolchain Microconference on Friday to discuss topics that are more specific to the interaction between the Linux kernel and the toolchain.

• 07:00
Q&A: GCC Steering Committee, GLIBC, GDB, Binutils Stewards 25m

Question and Answer session and general discussion with members of the GCC Steering Committee, GLIBC Stewards, GDB Stewards, Binutils Stewards, and GNU Toolchain Fund Trustees.

Speaker: David Edelsohn (IBM Research)
• 07:25
Break (5 minutes) 5m
• 07:30
The LLVM/GCC BoF 25m

We had a panel led discussion at last year's GNU Tools Cauldron and more recently at the FOSDEM LLVM Developer's room on improving cooperation between GNU and LLVM projects. This year we are proposing an open format BoF, particularly because we believe that being part of LPC and a virtual confernce we may have more LLVM and GNU developers in the same (virtual) room.

At both previous session we have explored the issues, but struggled to come up with concrete actions to improve cooperation. This BoF will attempt to find concrete actions that can be taken.

Speaker: Dr Jeremy Bennett (Embecosm)
• 07:55
Break (5 minutes) 5m
• 08:00
Lightning Talk: Accelerating machine learning workloads using new GCC built ins 10m

Basic Linear Algebra Subprograms (BLAS) are used everywhere in machine learning and deep learning applications today. OpenBLAS is an optimized BLAS open source library used widely in AI workloads that implement algebraic operations for specific processor types.
This talk covers recent optimization in the OpenBLAS library for the POWER10 processor. As part of this optimization, assembly code for matrix multiplication
kernels in OpenBLAS is converted to C code using new compiler builtins. A sample optimization for matrix multiplication for POWER hardware in OpenBLAS will be used to explain how builtins are used and show the impact of application performance.

Speaker: Rajalakshmi S
• 08:10
Break (5 minutes) 5m
• 08:25
Break (5 minutes) 5m
• 08:55
Break (5 minutes) 5m
• 09:25
Break (5 minutes) 5m
• 09:55
Break (5 minutes) 5m
• 10:00
Update on the BPF support in the GNU Toolchain 25m

In 2019 Oracle contributed support for the eBPF (as of late renamed to just BPF) in-kernel virtual architecture to binutils and GCC. Since then we have continued working on the port, and recently sent a patch series upstream adding support for GDB and the GNU simulator.

This talk will describe this later work and other current developments, such as the gradual introduction of xbpf, a variant of BPF that removes most of the many restrictions in BPF, originally conceived as a way to ease the debugging of the port itself and of BPF programs, but that can also be leverated in non-kernel contexts that could benefit from a fully-toolchain-supported virtual architecture.

Speaker: Jose E. Marchesi (GNU Project, Oracle Inc.)
• 07:00 08:00
LPC Refereed Track Refereed Track/Virtual-Room (LPC 2020)

### Refereed Track/Virtual-Room

#### LPC 2020

150
• 07:00
Recent changes in the kernel memory accounting (or how to reduce the kernel memory footprint by ~40%) 45m

Not a long time ago memcg accounting used the same approach for all types of pages.Each charged page had a pointer at the memory cgroup in the struct page. And it held a single reference to the memory cgroup, so that the memory cgroup structure was pinned in the memory by all charged pages.

This approach was simple and nice, but it didn't work well for some kernel objects,which are often shared between memory cgroups. E.g. an inode or a denty can outlive the original memory cgroup by far, because it can be actively used by someone else.Because there was no mechanism for the ownership change, the original memory cgroup was pinned in the memory, so that only a very heavy memory pressure could get rid of it.This lead to a so called dying memory cgroups problem: an accumulation of dying memory cgroups with uptime.

It has been solved by switching to an indirect scheme, where slab pages didn't reference the memory cgroup directly, but used a memcg pointer in the corresponding slab cache instead.The trick was that the pointer can be atomically swapped to the parent memory cgroup. In combination with slab caches reference counters it allowed to solve the dying memcg problem,but made the corresponding code even more complex: dynamic creation and destruction of per-memcg slab caches required a tricky coordination between multiple objects with different life cycles.

And the resulting approach still had a serious flow: each memory cgroup had it's own set of slab caches and corresponding slab pages. On a modern system with many memory cgroups it resulted in a poor slab utilization, which varied around 50% in my case. This made the accounting quite expensive: it almost doubled the kernel memory footprint.

To solve this problem the accounting has to be moved from a page level to an object level.If individual slab objects can be effectively accounted on individual level, there is no more need to create per-memcg slab caches. A single set of slab caches and slab pages can be used by all memory cgroups, which brings the slab utilization back to >90% and saves ~40% of total kernel memory.To keep the reparenting working and not reintroduce the dying memcg problem, an intermediate accounting vessel called obj_cgroup is introduced. Of course, some memory has to be used to store an objcg pointer for each slab object, but it's by far smaller than consequences of a poor slab utilization. The proposed new slab controller [1] implements a per-object accounting approach.It has been used on the Facebook production hosts for several months and brought significant memory savings (in a range of 1 GB per host and more) without any known regressions.

The object-level approach can be used to add an effective accounting of objects, which are by their nature not page-based: e.g. percpu memory. Each percpu allocation is scattered over multiple pages, but if it's small, it takes only a small portion of each page. Accounting such objects was nearly impossible on a per-page basis (duplicating chunk infrastructure will result in a terrible overhead),but with a per-object approach it's quite simple. Patchset [2] implements it. Perpcu memory is getting more and more used as a way to solve the contention problem on a multi-CPU system. Cgroups internals and bpf maps seem to be biggest users at this time, but likely new use cases will be added. It can easily take hundreds on MBs on a host, so if it's not account edit creates an issue in container memory isolation.

[1] https://lore.kernel.org/linux-mm/20200527223404.1008856-1-guro@fb.com/
[2] https://lore.kernel.org/linux-mm/20200528232508.1132382-1-guro@fb.com/

• 07:45
Break 15m
• 07:00 11:00
Networking and BPF Summit Networking and BPF Summit/Virtual-Room (LPC 2020)

### Networking and BPF Summit/Virtual-Room

#### LPC 2020

150
• 07:00
Multiple XDP programs on a single interface - status and next steps 45m

At last year's LPC I presented a proposal for how to attach multiple XDP programs to a single interface and have them run in sequence. In this presentation I will follow up on that, and present the current status and next steps on this feature.

Briefly, the solution we ended up with was a bit different from what I envisioned at the last LPC: We now rely on the new 'freplace' functionality in BPF which allows a BPF program to replace a function in another BPF program. This makes it possible to implement the dispatcher logic in BPF, which is now part of the 'libxdp' library in the xdp-tools package.

In this presentation I will explain how this works under the covers, what it takes for an application to support this mode of operation, and discuss how we can ensure compatibility between applications, whether or not they use libxdp itself. I am also hoping to solicit feedback on the solution in general, including any possible deficiencies or possible improvements.

Speaker: Toke Høiland-Jørgensen (Red Hat)
• 07:45

In this talk we introduce Per Thread Queues (PTQ). PTQ is a type of network packet steering that allows application threads to be assigned dedicated network queues for both transmit and receive. This facility provides highly granular traffic isolation between applications and can also help facilitate high performance when combined with other techniques such as busy polling. PTQ extends both XPS and aRFS.

A key concept of PTQ is "global queues". These are a device independent, abstract representation of network queues. Global queues are as their name implies, they can be treated as managed resource across not only a system, but also across the a data center similar to how other resources are managed across a datacenter (memory, CPU, network priority, etc.). User defined semantics and QoS characteristics can be added to global queues. For instance, queue #233 in the datacenter might refer to a queue with QoS properties specific to handling video Ultimately in the data path, a global queue is resolved to a real device queue that provides the semantics and QoS associated with the global queue. This resolution happens per a device specific mapping functions that maps a global queue to a device queue.

Threads may be assigned a global queue for both transmit and receive. The assignment comes from pools of transmit and receive queues configured in a cgroup. When a thread starts in a cgroup, the queue pools of the cgroup are consulted. If a queue pool is configured, the kernel assigns a queue to the thread (either a TX queue, RX queue, or both). The assigned queues are stored in the threads task structure. To transmit, the mapped device queue for the assigned transmit queue is used in liue of XPS queue selection; for receive, the mapped device queue for the assigned receive queue is programmed into the device via ndo_rx_flow_steer.

This talk will cover the design, implementation, and configuration of PTQ. Additionally, we will present performance numbers and discuss some of the many ways that this work can be further enhanced.

Speaker: Tom Herbert
• 08:30
Break 30m
• 09:00
A programmable Qdisc with eBPF 45m

Today we have a few dozens of Qdisc’s available in Linux kernel, offering various algorithms to schedule network packets. You can change the parameters of each Qdisc, but you can not change the core algorithm of a given Qdisc. A programmable Qdisc offers a way to customize your own scheduling algorithms without writing a Qdisc kernel module from scratch. With eBPF emerges across the Linux network stack, it is time to explore how to integrate eBPF with Qdisc’s.

Unlike the existing eBPF TC filter and action, a programmable Qdisc is much more complicated, because we have to think about how to store skb’s and what we can offer for users to program. More importantly, a hierarchical Qdisc is even harder while could offer more flexibility.

We will examine the latest eBPF functionalities and packet scheduler architecture, discuss those challenges with possible solutions for a programmable Qdisc with eBPF.

Speaker: Cong Wang
• 09:45
eBPF in kernel lockdown mode 45m

Linux has a new 'lockdown' security mode where changes to the running kernel
requires verification with a cryptographic signature and restrictions to
accesses to kernel memory that may leak to userspace.

Lockdown's 'integrity' mode requires just the signature, while in
'confidentiality' mode in addition to requiring a signature the system can't
leak confidential information to userspace.

Work needs to be done to add cryptographic signatures for eBPF bytecode. The
signature be then passed to the kernel via sys_bpf() reusing the kernel module
signing infrastructure.

The main eBPF loader, libbpf, may perform relocations on the received bytecode
for things like CO-RE (Compile Once, Run Everywhere), thus tampering with the
signature made with the original bytecode.

It is thus needed to move such modifications to the signed bytecode from libbpf
to the kernel, so that it may be done after the signature is verified.

This presentation is intended to provide a problem statement, some ideas being
feature so that BPF can be used in environments where 'lockdown' mode is
required.

Speaker: Mr Arnaldo Melo (Red Hat Inc.)
• 07:00 11:00
RISC-V MC Microconference3/Virtual-Room (LPC 2020)

### Microconference3/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
Testing and Fuzzing MC Microconference1/Virtual-Room (LPC 2020)

### Microconference1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
VFIO/IOMMU/PCI MC Microconference2/Virtual-Room (LPC 2020)

### Microconference2/Virtual-Room

#### LPC 2020

150
• 08:00 11:00
Kernel Summit Refereed Track/Virtual-Room (LPC 2020)

### Refereed Track/Virtual-Room

#### LPC 2020

150
• 08:00
SoC support lifecycle in the kernel 45m

The world of system-on-chip computing has changed drastically over the past years with the current state being much more diverse as the industry keeps moving to 64-bit processors, to little-endian addressing, to larger memory capacities, and to a small number of instruction set architectures.

In this presentation, I discuss how and why these changes happen, and how we can find a balance between keeping older technologies working for those that rely on them, and identifying code that has reached the end of its useful life and should better get removed.

Speaker: Arnd Bergmann (Linaro)
• 08:45
Break 15m
• 09:00
seccomp feature development 45m

As outlined in https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/ the topics include:

• fd passing
• deep argument inspection
• changing structure sizes

Specifically, seccomp needs to grow the ability to inspect Extensible Argument syscalls, which requires that it inspect userspace memory without Time-of-Check/Time-of-Use races and without double-copying. Additionally, since the structures can grow and be nested, there needs to be a way to deal with flattening the arguments into a linear buffer that can be examined by seccomp's BPF dialect. All of this also needs to be handled by the USER_NOTIF implementation. Finally, fd passing needs to be finished, and there needs to be an exploration of syscall bitmasks to augment the existing filters to gain back some performance.

• 09:45
Break 15m
• 10:00
DAMON: Data Access Monitoring Framework for Fun and Memory Management Optimizations 45m

# Background

In an ideal world, memory management provides the optimal placement of data objects under accurate predictions of future data access. Current practical implementations, however, rely on coarse information and heuristics to keep the instrumentation overhead minimal. A number of memory management optimization works were therefore proposed, based on the finer-grained access information. Lots of those, however, incur high data access pattern instrumentation overhead, especially when the target workload is huge. A few of the others were able to keep the overhead small by inventing efficient instrumentation mechanisms for their use case, but such mechanisms are usually applicable to their use cases only.

We can list up below four requirements for the data access information instrumentation that must be fulfilled to allow adoption into a wide range of production environments:

• Accuracy. The instrumented information should be useful for DRAM level memory management. Cache-level accuracy would not highly required, though.
• Light-weight overhead. The instrumentation overhead should be low enough to be applied online while making no impact on the performance of the main workload.
• Scalability. The upper-bound of the instrumentation overhead should be controllable regardless of the size of target workloads, to be adopted in general environments that could have huge workloads.
• Generality. The mechanism should be widely applicable.

# DAMON: Data Access MONitor

DAMON is a data access monitoring framework subsystem for the Linux kernel that designed to mitigate this problem. The core mechanisms of DAMON called 'region based sampling' and 'adaptive regions adjustment' make it fulfill the requirements. Moreover, its general design and flexible interface allow not only the kernel code but also the user space can use it.

Using this framework, therefore, the kernel's core memory management mechanisms including reclamation and THP can be optimized for better memory management. The memory management optimization works that incurring high instrumentation overhead will be able to have another try. In user space, meanwhile, users who have some special workloads will be able to write personalized tools or applications for deeper understanding and specialized optimizations of their systems.

In addition to the basic monitoring, DAMON also provides a feature dedicated to semi-automated memory management optimizations, called DAMON-based Operation Schemes (DAMOS). Using this feature, the DAMON users can implement complex data access aware optimizations in only a few lines of human-readable schemes descriptions.

We evaluated DAMON's overhead, monitoring quality, and usefulness using 25 realistic workloads on my QEMU/KVM based virtual machine.

DAMON is lightweight. It increases system memory usage by only -0.39% and consumes less than 1% CPU time in the typical case. It slows target workloads down by only 0.63%.

DAMON is accurate and useful for memory management optimizations. An experimental DAMON-based operation scheme for THP removes 69.43% of THP memory overhead while preserving 37.11% of THP speedup. Another experimental DAMON-based reclamation scheme reduces 89.30% of residential sets and 22.40% of system memory footprint while incurring only 1.98% runtime overhead in the best case.

# Current Status of The Project

Development of DAMON started in 2019, and several iterations were presented in academic papers[1,2,3], the kernel summit of last year[4], and an LWN article[4]. The source code is available[6] for use and modification, the patchsets[7] are periodically being posted for review.

# Agenda

I will briefly introduce DAMON and share how it has evolved since last year's kernel summit talk. I will introduce some new features, including the DAMON-based operation schemes. There will be a live demonstration and I will show performance evaluation results. I will outline plans and the roadmap of this project, leading to a Q&A session to collect feedback with a view on getting it ready for general use and upstream inclusion.

[1] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynamic Data Access Patterns with Bounded Overhead and Accuracy. In IEEE International Workshop on Foundations and Applications of Self- Systems (FAS 2019), June 2019. https://ieeexplore.ieee.org/abstract/document/8791992
[2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Access Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP International Middleware Conference Industry, December 2019. https://dl.acm.org/citation.cfm?id=3368125
[3] Yunjae Lee, Yunhee Kim, and Heon. Y. Yeom, Lightweight Memory Tracing for Hot Data Identification, In Cluster computing, 2020. (Accepted but not published yet)
[4] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and Best-effort Accuracy. In The Linux Kernel Summit, September 2019. https://linuxplumbersconf.org/event/4/contributions/548/
[5] Jonathan Corbet, Memory-management optimization with DAMON. In Linux Weekly News, February 2020. https://lwn.net/Articles/812707/
[6] https://github.com/sjp38/linux/tree/damon/master
[7] https://lore.kernel.org/linux-mm/20200525091512.30391-1-sjpark@amazon.com/

Speaker: Dr SeongJae Park (Amazon)
• Thursday, 27 August
• 07:00 11:00
BOFs Session BOF1/Virtual-Room (LPC 2020)

### BOF1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
GNU Tools Track GNU Tools track/Virtual-Room (LPC 2020)

### GNU Tools track/Virtual-Room

#### LPC 2020

150

The GNU Tools track will gather all GNU tools developers, to discuss current/future work, coordinate efforts, exchange reports on ongoing efforts, discuss development plans for the next 12 months, developer tutorials and any other related discussions.
The track will also include a Toolchain Microconference on Friday to discuss topics that are more specific to the interaction between the Linux kernel and the toolchain.

• 07:25
Break (5 minutes) 5m
• 07:30
BoF: Speed vs accuracy for math library optimization 25m

Math library developers sometimes can trade slight loss of accuracy
for significant performance gains or slight loss of performance
for significant accuracy gains. This BoF is to review some recent
and coming libm/libgcc changes and share ideas on how to decide
where to draw the line for loss of performance vs improved accuracy
and vice-versa.

Speaker: Patrick McGehearty (Oracle)
• 07:55
Break (5 minutes) 5m
• 08:00
Lightning talk: RISC-V BMI optimizations 10m

Support for the bit manipulation extension to RISC-V is currently out-of-tree and represents work by Jim Wilson at SiFive, Claire Wolf at Symbiotic EDA and Maxim Blinov at Embecosm. Since last year, I have been working on additional optimizations for the bit manipulation extension, which I shall present.

Speaker: Maxim Blinov (Embecosm)
• 08:10
Break (5 minutes) 5m
• 08:15
Lightning Talk: The challenges of GNU tool chain support for CORE-V 10m

CORE-V is a family of 32- and 64-bit cores based on the RISC-V architecture, being developed by the Open Hardware Group, a consortium of 50+ companies, universities and other organizations. It is based on the the family of RISC-V cores originally developed under the PULP project at ETH Zürich and the University of Bologna.

PULP cores already have an out-of-tree GNU tool chain, but it is based on GCC of 2017, and as would be expected is developed as a reasearch compiler to experiment with different extensions to the core. This talk will explore the challenges of getting from this tool chain to an up to date GNU tool chain, in-tree. The areas to be explored include

• migrating from a 2017 code base (still a lot of C) to the 2020 code
base (C++)
• retrospectively adding tests for 2,700 new instruction variants and
their associated compiler optmizations
• upstreaming extensions which, while present in manufactured silicon
and products, are not yet approved by the RISC-V Foundation
Speakers: Dr Jeremy Bennett (Embecosm), Dr Craig Blackmore (Embecosm)
• 08:25
Break (5 minutes) 5m
• 08:30
Kludging The editor with The compiler 25m

Emacs Lisp (Elisp) is the Lisp dialect used by the Emacs text editor
family. GNU Emacs can currently execute Elisp code either interpreted
or byte-interpreted after it has been compiled to byte-code. In this
presentation I'll discuss the libgccjit based Elisp compiler
implementation being integrated in Emacs. Though still a work in
progress, this implementation is able to bootstrap a functional Emacs
and compile all Emacs Elisp files, including the whole GNU Emacs
Lisp Package Archive (ELPA). Native compiled Elisp shows an increase of
performance ranging from ~2x up to ~40x with respect to the equivalent
byte-code, measured over a set of small benchmarks.

Speaker: Mr Andrea Corallo (Arm)
• 08:55
Break (5 minutes) 5m
• 09:00
State of flow-based diagnostics in GCC 25m

GCC has a robust set of diagnostics based on control- and data-flow analysis. They are able to detect many kinds of bugs primarily related to invalid accesses. In this talk I will give an overview of the latest state of some of these diagnostics and sketch out my ideas for future enhancements in this area.

Speaker: Mr Martin Sebor (Red Hat)
• 09:25
Break (5 minutes) 5m
• 09:55
Break (5 minutes) 5m
• 10:00
Enable Intel CET in Linux OS 25m
Speaker: H.J. Lu (Intel)
• 07:00 10:00
Kernel Summit Refereed Track/Virtual-Room (LPC 2020)

### Refereed Track/Virtual-Room

#### LPC 2020

150
• 07:00
Extensible Syscalls 45m

Most Linux syscall design conventions have been established through trial and
error. One well-known example is the missing flag argument in a range of
syscalls that triggered the addition of a revised version of theses syscalls.
Nowadays, adding a flag argument to keep syscalls extensible is an accepted
convention recorded in our kernel docs.

In this session we'd like to propose and discuss a few simple conventions that
have proven useful over time and a few new ones that were just established
recently with the addition of new in-kernel apis. Ideally these conventions
would be added to the kernel docs and maintainers encouraged to use them as
guidance when new syscalls are added.
We believe that these conventions can lead to a more consistent (and possibly
more pleasant) uapi going forward making programming on Linux easier for
userspace. They hopefully also prevent new syscalls running into various
design pitfalls that have lead to quirky or cumbersome apis and (security) bugs.

Topics we'd like to discuss include the use of structs versioned by size in
syscalls such as openat2(), sched_{set,get}_attr(), and clone3() and the
associated api that we added last year, whether new syscalls should be allowed
to use nested pointers in general and specifically with an eye on being
conveniently filterable by seccomp, the convention to always use unsigned int
as the type for register-based flag arguments intstead of the current potpourri
of types, naming conventions when revised versions of syscalls are added, and -
ideally a uniform way - how to test whether a syscall supports a given feature.

Speakers: Christian Brauner, Aleksa Sarai (SUSE LLC)
• 07:45
Break 15m
• 08:00
Kernel documentation 45m

The long process of converting the kernel's documentation into RST is
finally coming to an end...what has that bought us? We have gone from a
chaotic pile of incomplete, crufty, and un-integrated docs to a slightly
better organized pile of incomplete, crufty, slightly better integrated
docs. Plus we have the infrastructure to make something better from here.

What are the next steps for kernel documentation? What would we really
like our docs to look like, and how might we find the resources to get
them to that point? What sorts of improvements to the build
infrastructure would be useful? I'll come with some ideas (some of which
you've certainly heard before) but will be more interested in listening.

Speaker: Jonathan Corbet (Linux Plumbers Conference)
• 08:45
Break 15m
• 09:00

This proposal is recycled from the one I've suggested to LSF/MM/BPF [0].
Unfortunately, LSF/MM/BPF was cancelled, but I think it is still
relevant.

Restricted mappings in the kernel mode may improve mitigation of hardware
speculation vulnerabilities and minimize the damage exploitable kernel bugs
can cause.

There are several ongoing efforts to use restricted address spaces in
Linux kernel for various use cases:
speculation vulnerabilities mitigation in KVM [1]
support for memory areas with more restrictive protection that the
defaults ("secret", or "protected" memory) [2], [3], [4]
* hardening of the Linux containers [ no reference yet :) ]

Last year we had vague ideas and possible directions, this year we have
several real challenges and design decisions we'd like to discuss:

• "Secret" memory userspace APIs

Should such API follow "native" MM interfaces like mmap(), mprotect(),
madvise() or it would be better to use a file descriptor , e.g. like
memfd-create does?

MM "native" APIs would require VM_something flag and probably a page flag
or page_ext. With file-descriptor VM_SPECIAL and custom implementation of
.mmap() and .fault() would suffice. On the other hand, mmap() and
mprotect() seem better fit semantically and they could be more easily

• Direct/linear map fragmentation

Whenever we want to drop some mappings from the direct map or even change
the protection bits for some memory area, the gigantic and huge pages
that comprise the direct map need to be broken and there's no THP for the
kernel page tables to collapse them back. Moreover, the existing API
defined in <asm/set_memory.h> by several architectures do not really
presume it would be widely used.

For the "secret" memory use-case the fragmentation can be minimized by
caching large pages, use them to satisfy smaller "secret" allocations and
than collapse them back once the "secret" memory is freed. Another
possibility is to pre-allocate physical memory at boot time.

Yet another idea is to make page allocator aware of the direct map layout.

• Kernel page table management

Currently we presume that only one kernel page table exists (well,
mostly) and the page table abstraction is required only for the user page
tables. As such, we presume that 'page table == struct mm_struct' and the
mm_struct is used all over by the operations that manage the page tables.

The management of the restricted address space in the kernel requires
ability to create, update and remove kernel contexts the same way we do
for the userspace.

One way is to overload the mm_struct, like EFI and text poking did. But
it is quite an overkill, because most of the mm_struct contains
information required to manage user mappings.

My suggestion is to introduce a first class abstraction for the page
table and then it could be used in the same way for user and kernel
context management. For now I have a very basic POC that slitted several
fields from the mm_struct into a new 'struct pg_table' [5]. This new
abstraction can be used e.g. by PTI implementation of the page table
cloning and the KVM ASI work.

[0] https://lore.kernel.org/linux-mm/20200206165900.GD17499@linux.ibm.com/
[1] https://lore.kernel.org/lkml/20200504145810.11882-1-alexandre.chartre@oracle.com
[2] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de/
[3] https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
[4] https://lore.kernel.org/lkml/20200522125214.31348-1-kirill.shutemov@linux.intel.com
[5] https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=pg_table/v0.0

Speaker: Mike Rapoport (IBM)
• 07:00 11:00
LLVM MC Microconference1/Virtual-Room (LPC 2020)

### Microconference1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
Networking and BPF Summit Networking and BPF Summit/Virtual-Room (LPC 2020)

### Networking and BPF Summit/Virtual-Room

#### LPC 2020

150
• 07:00
Kubernetes service load-balancing at scale with BPF & XDP 45m

With the incredible pace of containerisation in enterprises, the combination of Linux and Kubernetes as an orchestration base layer is often considered as the "cloud OS". In this talk we provide a deep dive on Kubernetes's service abstraction and related to it the path of getting external network traffic into one's cluster.

With this understanding in mind, we then discuss issues and shortcomings of the existing kube-proxy implementation in Kubernetes for larger scale and high churn environments and how it can be replaced entirely with the help of Cilium by utilising BPF and XDP. Cilium's service load-balancing architecture consists of two main components, that is, BPF at the socket layer for handling East-West traffic and BPF at the driver layer for processing the North-South traffic path.

Given XDP has only recently been added to Cilium in order to accelerate service load-balancing, we'll discuss our path towards implementing the latter, lessons learned, provide a detailed performance analysis compared to kube-proxy in terms of forwarding cost as well as CPU consumption, and future extensions on kernel side.

Speakers: Daniel Borkmann (Cilium.io), Martynas Pumputis (Cilium)
• 07:45
Networking Androids 45m

Android Networking - update for 2020:
- what are our pain points wrt. kernel & networking in general,
- progress on upstreaming Android Common Kernel networking code,
- and the unknown depths of non-common vendor changes,
- how we're using bpf,
- how it's working,
- what's not working,
- how it's better then writing kernel code,
- why it's so much worse,
- etc...

Speaker: Mr Maciej Zenczykowski (Google, Inc.)
• 08:30
Break 30m
• 09:00
Right-sizing is hard, resizable BPF maps for optimum map size 45m

Right-sizing BPF maps is hard. By allocating for a worse case scenario we build large maps consuming large chunks of memory for a corner case that may never occur. Alternatively, we may try to allocate for the normal case choosing to ignore or fail in the corner cases. But, for programs running across many different workloads and system parameters its difficult to even decide what a normal case looks like. For a few maps we may consider using the BPF_F_NO_PREALLOC flag, but here we are penalized at allocation time and still need to charge our memory limits to match our max memory usage.

For a concrete example, consider a sockhash map. This map allows users to insert sockets into a map to build load balancers, socket hashing, policy, etc. but, how do we know how many sockets will exist in a system. What do we do when we overrun the table?

In this talk we propose a notion of resizable maps. The kernel already supports resizable arrays and resizable hash tables giving us a solid grounding to extend the underlying data structures of similar maps in BPF. Additionally, we also have the advantage of allowing the BPF programmer to tell us when to grow these maps to avoid hard-coded heuristics.

We will provide two concrete examples where the above has proven useful. First, using the sockmap and sockhash tables noted above. This way we can issue a bpf_grow_map() indicating to the BPF map code more slots should be allocated if possible. We can decide using BPF program logic where to put this low-water mark. Finally, we will also illustrate how using resizable arrays can ensure the system doesn't run out of slots for the associated data in an example program. This has become a particularly difficult problem to solve with the current implementations where worse case can be severe, requiring 10x or more entries than the normal case. With the addition of resizable maps we expect many of the issues with right-sizing can be eliminated.

Speaker: John Fastabend (Isovalent)
• 09:45
How we built Magic Transit 45m

In this talk we will present Magic Transit, Cloudflare's layer 3 DDoS protection service, as a case study in building a network product from the standard linux networking stack. Linux provided us with flexibility and isolation that allowed us to stand up this product and on-board more than fifty customers within a year of conceptualization. Cloudflare runs all of our services on every server on our edge, and Magic Transit is not an exception to that rule - one of our biggest design challenges was working a layer 3 product into a networking environment tuned for proxy and server products. We'll cover how we built Magic Transit, what worked really well, and what challenges we encountered along the way.

Magic Transit is largely implemented as a “configurator”, that is our software manages the network setup, and lets the kernel do the heavy lifting with network namespaces, policy routing and netfilter to safely direct and scrub IP traffic for our customers. This design allows drop-in integration with our DDoS protection systems, and our proxying and L7 products, and in a way that our operations team was familiar with. These benefits do not come without their caveats; specifically route placement/reporting inconsistencies, quirks revolving around icmp packets being generated from within a namespace when fragmentation occurs, problems stemming from conntrack and a mystery around offload… Finally we’ll touch on our future plans to migrate our web of namespaces to a Rust service that makes use of ebpf/xdp.

Speakers: Mr Erich Heine (Cloudflare), Mr Connor Jones (Cloudflare)
• 07:00 11:00
System Boot and Security MC Microconference2/Virtual-Room (LPC 2020)

### Microconference2/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
You, Me, and IoT Two MC Microconference3/Virtual-Room (LPC 2020)

### Microconference3/Virtual-Room

#### LPC 2020

150
• Friday, 28 August
• 07:00 11:00
Application Ecosystem MC Microconference3/Virtual-Room (LPC 2020)

### Microconference3/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
BOFs Session BOF1/Virtual-Room (LPC 2020)

### BOF1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
GNU Toolchain MC GNU Tools track/Virtual-Room (LPC 2020)

### GNU Tools track/Virtual-Room

#### LPC 2020

150

The GNU Toolchain microconference focuses on specific topics related to the GNU Toolchain that have a direct impact in the development of the Linux kernel, and that can benefit from some live discussion and agreement between the GNU toolchain and kernel developers.

• 07:00
System call wrappers for glibc 30m

Most programmers prefer to call system calls via functions from their C library of choice, rather than using the generic syscall function or custom inline-assembler sequences wrapping a system callinstruction. This means that it is desirable to add C library support for new system calls, so that they become more widely usable.

This talk covers glibc-specific requirements for adding new system call wrappers to the GNU C Library (glibc), namely code, tests, documentation, patch review, and copyright assignment (not necessarily in that order). Developers can help out with some of the steps even if they are not familiar with glibc procedures or have reservations about the copyright assignment process.

I plan to describe the avoidable pitfalls we have encountered repeatedlyover the years, such as tricky calling conventions and argument types, or multiplexing system calls with polymorphic types. The ever-present temptation of emulating system calls in userspace is demonstrated with examples.

Finally, I want to raise the issue of transition to new system call interfaces which are a superset of existing system calls, and the open problems related to container run-times and sandboxes with seccomp filters—and the emergence of non-Linux implementations of the Linux system call API.

The intended audience for this talk are developers who want to help with getting system call wrappers added to glibc, and kernel developers who define new system calls or review such patches.

Speaker: Florian Weimer (Red Hat)
• 07:00 11:00
Networking and BPF Summit Networking and BPF Summit/Virtual-Room (LPC 2020)

### Networking and BPF Summit/Virtual-Room

#### LPC 2020

150
• 07:00
Eliminating bugs in BPF JITs using automated formal verification 45m

This talk will present our ongoing efforts of using formal verification
to eliminate bugs in BPF JITs in the Linux kernel. Formal verification
rules out classes of bugs by mechanically proving that an implementation
adheres to an abstract specification of its desired behavior.

We have used our automated verification framework, Serval, to find 30+
new bugs in JITs for the x86-32, x86-64, arm32, arm64, and riscv64
architectures. We have also used Serval to develop a new BPF JIT for
riscv32, RISC-V compressed instruction support for riscv64, and new
optimizations in existing JITs.

The talk will roughly consist of the following parts:

• A report of the bugs we have found and fixed via verification, and
why they escaped selftests.
• A description of how the automated formal verification works,
including a specification of JIT correctness and a proof strategy for
automated verification.
• A discussion of future directions to make BPF JITs more amenable
to formal verification.

The following links to a list of our patches in the kernel, as well as
the code for the verification tool and a guide of how to run it:

https://github.com/uw-unsat/serval-bpf

Speaker: Luke Nelson (University of Washington)
• 07:45
BPF extensible network: TCP header option, CC, and socket local storage 45m

This talk will discuss some recent works that extend the TCP stack with BPF: TCP header option, TCP Congestion Control (CC), and socket local storage.

Hopefully the talk can end with getting ideas/desires on which part of the stack can practically be realized in BPF.

Speaker: MARTIN LAU
• 08:30
Break 30m
• 09:00
Userspace OVS with HW Offload and AF_XDP 45m

OVS has two major datapaths: 1) the Linux kernel datapath, which shipped with Linux distributions and 2) the userspace datapath, which usually coupled with DPDK library as packet I/O interface, and called OVS-DPDK. Recent OVS also supports two offload mechanisms: the TC-flower for the kernel datapath, and the DPDK rte_flow for the userspace datapath. The tc-flower API with kernel datapath seems to be more feature-rich, with the support for connection tracking. However, the userspace datapath is in general faster than the kernel datapath, due to more packet processing optimizations.

With the introduction of AF_XDP to OVS, the userspace datapath can process packets at high rate without requiring DPDK library. AF_XDP socket creates a fast packet channel to the OVS userspace datapath and shows similar performance compared to using DPDK. In this case, the AF_XDP socket with OVS userspace datapath enables a couple of new ideas. First, unlike OVS-DPDK, with AF_XDP, the userspace datapath can enable TC-flower offload, because the device driver is still running in the kernel. Second, when considering flows which can’t be offloaded to the hardware, ex: L7 processing, these flows can be redirected to OVS userspace datapath using AF_XDP socket, which is faster than processing in kernel. And finally, users can implement new features using a custom XDP program attached to the device, when flows can’t be offloaded due to lack of hardware support.

In summary, with this architecture, we hope that a flow can be processed in the following sequences:
1) In hardware with tc-flower API. This shows best performance with the latest hardware. And if not capable,
2) In XDP. This shows second to the hardware performance, with the flexibility for new features and with eBPF verifier’s safety guarantee. And if not capable,
3) In OVS userspace datapath. This shows the best software switching performance.

Moving forward, we hope to unify the two extreme deployment scenarios; the high performance NFV cases using OVS-DPDK, and the enterprise hypervisor use cases using OVS kernel module, by just using the OVS userspace datapath with AF_XDP. Currently we are exploring the feasibility of this design and limitations. We hope that by presenting this idea, we can get feedback from the community.

Speaker: William Tu (VMware)
• 07:00 11:00
Open Printing MC Microconference1/Virtual-Room (LPC 2020)

### Microconference1/Virtual-Room

#### LPC 2020

150
• 07:00 11:00
Power Management and Thermal Control MC Microconference2/Virtual-Room (LPC 2020)

150