Linux Plumbers Conference 2019



September 9-11, Lisbon, Portugal

The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond.  LPC 2019 will be held September 9-11 in Lisbon, Portugal.  We are looking forward to seeing you there!

    • 10:00 13:30
      Distribution Kernels MC

      The upstream kernel community is where active kernel development happens but the majority of kernels deployed do not come directly from upstream but distributions. "Distribution" here can refer to a traditional Linux distribution such as Debian or Gentoo but also Android or a custom cloud distribution. The goal of this Microconference is to discuss common problems that arise when trying to maintain a kernel.

      Expected topics
      Backporting kernel patches and how to make it easier
      Consuming the stable kernel trees
      Automated testing for distributions
      Managing ABIs
      Distribution packaging/infrastructure
      Cross distribution bug reporting and tracking
      Common distribution kconfig
      Distribution default settings
      Which patch sets are distributions carrying?
      More to be added based on CfP for this microconference

      "Distribution kernel" is used in a very broad manner. If you maintain a kernel tree for use by others, we welcome you to come and share your experiences.

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Laura Abbott

      • 10:00
        Upstream 1st: Tools and workflows for multi kernel version juggling of short term fixes, long term support, board enablement and features with the upstream kernel 20m

        Having maintained a distribution agnostic reference kernel (Yocto), an operating
        system vendor kernel (Wind River) and finally a semi-conductor kernel (Xilinx),
        there are a lot of obvious workflows and tools that are used to deliver kernels
        and support them after release.

        The less than obvious workflows (and tools) are often related to distro kernel
        tree maintenance and balancing the needs of short term fixes (often security
        related), with a model that allows long term support, all in trees that may be
        carrying specific features or board support that are destined for upstream
        eventually. Many methods to juggle these demands are ad-hoc or specific to the
        various distros.

        If a tree is not (somewhat) history clean, and patch history is not tracked
        over time, moving to a new kernel version, understanding why a change was made
        or debugging a problem are made much harder.

        All the competing demands are coupled with the need to have development supported
        with the goal of getting changes into the mainline kernel. Understanding the
        technical solutions (tools), workflows (tools + social) and how to support the
        community at large to reduce everyone's workload is often given limited time.
        Stepping back and looking at the different solutions that maintainers are using
        may highlight common patterns and opportunities to collaborate/standardize on
        various techniques. Less-than-ideal solutions are also valuable as lessons
        learned and are worth sharing.

        Speaker: Bruce Ashfield (Xilinx)
      • 10:20
        Using Yocto to build a distro and maintain a kernel tree 20m

        We'd like to spend a few minutes to provide some background around how we're using Yocto to produce kernel builds as well as bigger images that contain userspace as well, and then try to address some of the issues we're seeing with this process.

        There are a few topics we'd like to discuss with the room:

        • Using a single kernel branch for multiple, very different projects?
        • Working with kernel config fragments?
        • Reproducible kernel builds/cloning sources?
        • Is there anything saner than cve-check for pointing out known security vulnerabilities?
        Speakers: Senthil Rajaram, Sasha Levin
      • 10:40
        Making it easier for distros to package kernel source 20m

        Every distro has to package the kernel tree using their own unique package
        files. Some parts of the process are built-in to the kernel source and are
        easy: build, install, and headers. Some parts are not: configs, devel
        package, userspace tools package, tests, distro versioning, changelogs,
        custom patches, etc.

        This discussion revolves around some of the issues and difficulties a
        distro maintainer faces when packaging the kernel source code. What changes
        can we agree to push upstream to make our lives easier.

        Further, discuss possibilities of plugging in distro packaging into the
        kernel source tree (through external means or internal hooks). This allows
        developers to quickly build (from a common devel env) a particular
        distro-like kernel for proper testing.

        Sample topics include:
        * config maintainence for distros
        * top-level Makefile hooks for distros
        * make devel_install -like command
        * distro versioning

        Speaker: Don Zickus (Red Hat)
      • 11:00
        Monitoring and Stabilizing the In-Kernel ABI 30m

        The Kernel's API and ABI exposed to Kernel modules is not something that is usually maintained in upstream. Deliberately. In fact, the ability to break APIs and ABIs can greatly benefit the development. Good reasons for that have been stated multiple times. See e.g. Documentation/process/stable-api-nonsense.rst.
        The reality for distributions might look different though. Especially - but not exclusively - enterprise distributions aim to guarantee ABI stability for the lifetime of their released kernels while constantly consuming upstream patches to improve stability and security for said kernels. Their customers rely on both: upstream fixes and the ability to use the released kernels with out-of-tree modules that are compiled and linked against the stable ABI.

        In this talk I will give a brief overview about how this very same requirement applies to the Kernels that are part of the Android distribution. The methods presented here are reasonable measures to reduce the complexity of the problem by addressing issues introduced by ABI influencing factors like build toolchain, configurations, etc.

        While we focus on Android Kernels, the tools and mechanisms are generally useful for Kernel distributors that aim for a similar level of stability. I will talk about the tools we use (like e.g. libabigail), how we automate compliance checking and eventually enforce ABI stability.

        Speaker: Matthias Maennich (Google)
      • 12:00
        KernelCI applied to distributions 20m

        While as a project is dedicated to testing the
        upstream Linux kernel, the same KernelCI software may be reused
        for alternative purposes. One typical example is distribution
        kernels, which often track a stable branch but also carry some
        extra patches and a specific configuration. Aside from covering
        a particular downstream branch, having a separate KernelCI
        instance also makes it possible to add specific tests that cover
        user-space functionality.

        A key aspect of KernelCI however is that the moving part remains
        the kernel revision. It is in theory possible to cover a full OS
        image with moving parts in user-space too, but that is not
        something it was originally designed for - hence an interesting
        subject for discussion.

        Speaker: Guillaume Tucker (Collabora Limited)
      • 12:20
        Automatically testing distribution kernel packages 20m

        Provide better kernel packages to the distribution users, is a really hot topic in distributions, as the kernel package is the fundamental part of the distribution.
        One of the way to provide a better quality kernel is to implement a quality control by using automated tests.
        Each distributions are probably using different tools and tests suits.
        Let's share our knowledge and which tools are using.

        Which Continuous integrations tools are better to use? (buildbot, jenkins)
        What kernel tools are better to use for testing (lpt, kselftest)

        Speaker: Alice Ferrazzi
      • 12:40
        Distros and Syzkaller - Why bother? 20m

        Syzkaller is run on Upstream and Stable trees. When paired with KASAN it has proven its usefulness uncovering large numbers of Out-of-Bounds (OOB) and Use-after-free (UAF) bugs. These results are readily available on the syzbot dashboard. What do distros gain by running Syzkaller?

        Distros regularly add features to their kernels, fix bugs and add third party drivers. Syzkaller testing focused on these changes and additions can uncover bugs and detect regressions.

        Syzkaller can be part of a distro's continuous integration (CI) strategy. Dedicated Syzkaller CI servers can be running the distro's next release candidate, only being halted and restarted as features, bug fixes or third party drivers are added.

        How can distros collaborate? There are many third party drivers common to all distros. Distros can collaborate on the Syzkaller testing framework for these drivers. Likewise for features that are going Upstream.

      • 13:00
        Being Kernel Maintainer at Oracle - Lessons & Challenges. 20m

        Linux kernel maintenance is widely spoken topic at many conferences. Yet, it has it's own complex share of problems which are unique to maintainers, sub-systems and Organizations.

        Oracle has a very Open and challenging environment but with access to a lot of information and knowledge about our customer's products and strategies, it can very tricky for a kernel maintainer especially the challenges it brings and also the need to keep a keen eye on internal and public discussions.

        In this talk, I would like to share my experiences of being a Kernel Maintainer at Oracle. The topics that would be covered are:
        - How we maintain the kernel(UEK)?
        - How does the Kernel stay up-to-date (stable fixes)?
        - How we handle KABI breakages and updates.
        - How do we handle back-ports and Security fixes(CVE's).

        In addition to the above, we would also like to talk about our Upstream tracking
        project which essentially helps developers to keep their work up-to-date with mainline.

    • 10:00 19:30
      Kernel Summit Track
      • 11:30
        Break 30m
      • 13:30
        Lunch 1h 30m
      • 16:30
        Break 30m
      • 18:30
        TAB Elections 1h
    • 10:00 18:30
      LPC Refereed Track
      • 10:00
        Maintaining out of tree patches over the long term 45m

        The PREEMPT_RT patchset is the longest existing large patchset living outside the Linux kernel. Over the years, the realtime developers had to maintain several stable kernel versions of the patchset. This talk will present the lessons learned from this experience, including workflow, tooling and release management that has proven over time to scale. The workflow deals with upstream changes and changes to the patchset itself. Now that the PREEMPT_RT patchset is about the be merged upstream, we want to share our toolset and methods with others who may be able to benefit from our experience.

        This talk is for people who want to maintain an external patchset with stable releases.

        Speakers: Daniel Wagner, Daniel Bristot de Oliveira (Red Hat, Inc.), Steven Rostedt, Tom Zanussi, John Kacur
      • 10:45
        Core Scheduling: Taming Hyper-Threads to be secure 45m

        Last couple of years, we have witnessed an onslaught of vulnerabilities in the design and architecture of cpus. It is interesting and surprising to note that the vulnerabilities are mainly targeting the features designed to improve the performance of cpus - most notable being the hyperthreading(smt). While some of the vulnerabilities could be mitigated in software and cpu microcodes, couple of others didn't have any satisfiable mitigation other than making sure that smt is off and every context switch needed to flush the cache to clear the data used by the task that is being switched out. Turning smt off is not a viable alternative to many production scenarios like cloud environment where you lose a considerable amount of computing power by turning off smt. To address this, there have been community efforts to keep smt on while trying to make sure that non-trusting applications are never run concurrently in the hyperthreads of the core, they have been widely called as core scheduling.

        This talk is about the development, testing and profiling efforts of core scheduling in the community. There were multiple proof of concepts - while differing in the design, ultimately trying to make sure that only mutually trusted applications run concurrently on the core. We discuss the design, implementation and performance of the POCs. We also discuss the profiling attempts to understand the correctness and performance of the patches - various powerful kernel features that we leveraged to get the most time sensitive data from the kernel to understand the effect of scheduler with the core scheduling feature. We plan to conclude with a brief discussion of the future directions of core scheduling.

        The core idea about core scheduling is to have smt on and make sure that only trusted applications run concurrently on siblings of a core. If there are no group of trusting applications runnable on the core, we need to make sure that remaining siblings should idle while applications run in isolation on the core. This should also consider the performance aspects of the system. Theoretically it is impossible to reach the same level of performance where the cores are allowed to any runnable applications. But if the performance of core scheduling is worse than or same as the smt off situation, we do not gain anything from this feature other than the added complexity in the scheduler. So the idea is to achieve a considerable boost in performance compared to smt-off for the majority of production workloads.

        Security boundary is another aspect of critical importance in core scheduling. What should be considered as a trust boundary? Should it be at the user/group level, process level or thread level? Should kernel be considered trusty by applications or vice-versa? With virtualization and nested virtualization in picture, this gets even more complicated. But answers to most of these questions are environment and workload dependent and hence these are implemented as policies rather than hardcoding in the code. And then arises the question - how the policies should be implemented? Kernel has a variety of mechanisms to implement these kind of policies and the proof of concepts posted upstream mainly uses cgroups. This talk also discusses other viable options for implementing the policies.

        Speakers: Julien Desfossez (DigitalOcean), Vineeth Remanan Pillai
      • 11:30
        Break 30m
      • 12:00
        Scaling performance profiling infrastructure for data centers 45m

        Understanding Application performance and utilization characteristics is critically important for cloud-based computing infrastructure. Minor improvements in predictability and performance of tasks can result in large savings. Google runs all workloads inside containers and as such, cgroup performance monitoring is heavily utilized for profiling. We rely on two approaches built on Linux performance monitoring infrastructure to provide task, machine, and fleet performance views and trends. A sampling approach collect metrics across the machine and try to attribute it back to cgroups while a counting approach tracks when a cgroup is scheduled and maintains state per cgroup. There are number of trade-offs associated with both approaches. We will present an overview and associated use-cases for both approaches at Google.

        As the servers have gotten bigger, number of cores and containers on a machine have grown significantly. With the bigger scale, interference is a bigger problem for multi-tenant machines and performance profiling becomes even more critical. However, we have hit multiple issues in scaling the underlying Linux performance monitoring infrastructure to provide fresh and accurate data for our fleet. The performance profiling has to deal with the following issues:

        • Interference: To be tolerated by workloads, monitoring overhead
          should be minimal - usually below 2%, some latency-sensitive workloads
          are certainly even less tolerant than that. As we gain more
          introspection into our workloads, we end up having to use more and
          more events, to pinpoint certain bottlenecks. That unavoidably
          incurs event multiplexing as the number of core hardware counters is
          very limited compared to containers profiled and number of events monitored. Adding counters is not free in hardware and similarly in the kernel as more
          work registers must be saved and restored on context switches which can cause jitters for applications being profiled.
        • Accuracy: Sampling at machine level reduces some of the associated costs, but attributing the counters back to containers is lossy and we see a large drop in accuracy of profiling. The attribution gets progressively worse as we move to bigger machines with large number of threads. The attribution errors severely limit the granularity of performance improvements and degradations we can measure in our fleet.
        • Kernel overheads: Perf_events event multiplexing is a complex and expensive algorithm that is especially taxing when run in cgroup mode. As implemented, scheduling of cgroup events is bound by the number of cgroup events per-cpu and not the number of counters, unlike regular per-cpu monitoring. To get a consistent view of activity on a server, Google needs to periodically count events per-cgroup. Cgroup monitoring is preferred over per-thread monitoring because Google workloads tend to use an extensive number of threads, so that would be prohibitively expensive to use. We have explored ways to avoid these scaling issues and make event multiplexing faster.
        • User-space overheads: The bigger the machines, the larger the volume of profiling data generated. Google relies extensively on the perf record tool to collect profiles. There are significant user-space overheads to merge the per-cpu profiles and post-process for attribution. As we look to make perf-record multi-threaded for scalability, data collection and merging becomes yet another challenge.
        • Symbolization overheads : Perf tools rely on /proc/PID/maps to understand process mappings and to symbolize samples. The parsing and scanning of /proc/PID/maps is time-consuming with large overheads. It is also riddled with race conditions as processes are created and destroyed during parsing.

        These are some of the challenges we have encountered while using perf_events and the perf tool at scale. To continue to make this infrastructure popular, it needs to adapt to new hardware and data-center realities fast now. We are planning to share our findings and optimizations followed by an open discussion on how to best solve these challenges.

        Speakers: Rohit Jnagal, Stephane Eranian (Google Inc), Ian Rogers (Google Inc)
      • 12:45
        printk: Why is it so complicated? 45m

        The printk() function has a long history of issues and has undergone many iterations to improve performance and reliability. Yet it is still not an acceptable solution to reliably allow the kernel to send detailed information to the user. And these problems are even magnified when using a real-time system. So why is printk() so complicated and why are we having such a hard time finding a good solution?

        This talk will briefly cover the history of printk() and why the recent major rework was necessary. It will go through the details of the rework and why we believe it solves many of the issues. And it will present the issues still not solved (such as fully synchronous console writing), why these issues are particularly complex and controversial, and review some of the proposed solutions for moving forward.

        This talk may be of particular interest to developers with experience or interest in lockless ring buffers, memory barriers, and NMI-safe synchronization.

        Speaker: John Ogness (Linutronix GmbH)
      • 13:30
        Lunch 1h 30m
      • 15:00
        What does remote attestation buy you? 45m

        TPM remote attestation (a mechanism allowing remote sites to ask a computer to prove what software it booted) was an object of fear in the open source community in the 2000s, a potential existential threat to Linux's ability to interact with the free internet. These concerns have largely not been realised, and now there's increasing interest in ways we can use remote attestation to improve security while avoiding privacy concerns or attacks on user freedom.

        More modern uses of remote attestation include simplifying deployment of machines to remote locations, easy recovery of systems with nothing more than a network connection, automatic issuance of machine identity tokens, trust-based access control to sensitive resources and more. We've released a full implementation, so this presentation will discuss how it can be tied in to various layers of the Linux stack in ways that give us new functionality without sacrificing security or freedom.

        Speaker: Matthew Garrett (Google)
      • 15:45
        Linux kernel fastboot on the way 45m

        Linux kernel fastboot is critical for all kinds of platforms: from embedded/smartphone to desktop/cloud, and it has been hugely improved over years. But, is it all done? Not yet!

        This topic will first share the optimizations done for our platform, which cut the kernel (inside a VM) bootime from 3000ms to 300ms, and then list the future potential optimization points.

        Here are our optimizations:
        1. really enable device drivers' asynchronous probing, like i915 to improve boot parallelization
        2. deferred memory init leveraging memory hotplug feature
        3. Optimize rootfs mounting (including storage driver and mounting)
        4. kernel modules and configs optimization
        5. reduce the hypervisor cost
        6. tools for profiling/analyzing

        Potential optimizations spots for future, which needs discussion and collaboration from the whole community:
        1. how to make maximal use of multi-core and effectively distribute boot tasks to each core
        2. smp init for each CPU core costs about 8ms, a big burden for large systems
        3. force highest cpufreq as early as possible (kernel decompress time)
        4. devices enumeration for firmware (like ACPI) set to be parallel
        5. in-kernel deferred memory init (for 4GB+ platform)
        6. user space optimization like systemd

        Speaker: Mr Feng Tang
      • 16:30
        Break 30m
      • 17:00
        Red Hat joins CI party, brings cookies 45m

        For the past couple of years the CKI ("cookie") project at Red Hat has been transforming the way the company tests kernels, going from staged testing to continuous integration. We've been testing patches posted to internal maillists, responding with our results, and last year we started testing stable queues maintained by Greg KH, posting results to the "stable" maillist.

        Now we'd like to expand our efforts to more upstream maillists, and join forces with CI systems already out there. We'll introduce you to the way our CI works, which tests we run, our extensive park of hardware, and how we report results. We'd like to hear what you need from a CI system, and how we can improve. We'd like to invite you to cooperation, both long-term, and right there, at a hackfest organized during the conference.

        Naturally, real cookies will make an appearance.

        Speakers: Nikolai Kondrashov (Red Hat), Veronika Kabatova (Red Hat)
      • 17:45
        Challenges of the RDMA subsystem 45m

        The RDMA subsystem in Linux (drivers/infiniband) is now becoming widely used and deployed outside its traditional use case of HPC. This wider deployment is creating demand for new interactions with the rest of the kernel and many of these topics are challenging. 

        This talk will include a brief overview of RDMA technology followed by an examination & discussion of the main areas where the subsystem has presented challenges in Linux: 

        Very complex user API. An overview of the current design, and some reflection on historical poor choices 

        The DMA from user space programming model and the challenge matching that to the DMA API in Linux 

        Development of user space drivers along with kernel drivers 

        Delegation of security decisions to HW 

        Interaction with file systems, DAX, and the page cache for long term DMA 

        Inter-operation with GPU, DMABUF, VFIO and other direct DMA subsystems 

        Growing breadth of networking functionality and overlap with netdev, virtio, and nvme 

        Fragmentation of wire protocols and resulting HW designs 

        Placing high performance as paramount and how this results in HW restrictions limiting the architecture and APIs of the subsystem 

        The advent of new general computation acceleration hardware is seeing new drivers proposed for Linux that have many similar properties to RDMA. These emerging drivers are likely to face these same challenges and can benefit from lessons learned. 

        RDMA has been a successful mini-conference at the last three LPC events, and this talk is intended to complement the proposed RDMA micro-conference this year. This longer more general topic is intended to engage people unfamiliar with the RDMA subsystem and the detailed topics that would be included in the RDMA track. 

        The main goal would be to help others in the kernel community have more background for RDMA and its role when making decisions. In part this proposal is motivated by the number of times I heard the word 'RDMA' mentioned at LSF/MM.  Often as some opaque consumer of some feature. 

        Jason Gunthorpe is a Sr. Principal Engineer at Mellanox and has been the co-maintainer for the RDMA subsystem for the last year and a half. He has 20 years' experience working with the Linux kernel and in RDMA and InfiniBand technologies.

        Speaker: Mr Jason Gunthorpe (Mellanox Technologies)
    • 10:00 18:30
      Networking Summit Track
      • 10:00
        Welcome 25m
        Speakers: Daniel Borkmann (, David Miller (Red Hat Inc.)
      • 10:45
        BPF packet capture helpers, libbpf interfaces 45m

        Packet capture is useful from a general debugging standpoint, and is useful in particular in debugging BPF programs that do packet processing. For general debugging, being able to initiate arbitrary packet capture from kprobes and tracepoints is highly valuable (e.g. what do the packets that reach kfree_skb() - representing error codepaths - look like?). Arbitrary packet capture is distinct from the traditional concept of pre-defined hooks, and gives much more flexibility in probing system behaviour. For packet-processing BPF programs, packet capture can be useful for doing things such as debugging checksum errors. The intent of this proposal is to help drive discussion around how to ease use of such features in BPF programs, namely:

        • should additional BPF helper(s) be provided to format packet data suitable for libpcap interpretation?
        • should libbpf provide interfaces for retrieving packet capture data?
        • should interfaces be provided for pushing filters?

        Note that while there has been some work in this area already, such as seems like such efforts would be made much simpler if APIs were provided.

      • 11:30
        Break 30m
      • 12:00
        Mutipath TCP Upstreaming 45m

        Multipath TCP (MPTCP) is an increasingly popular protocol that members of the kernel community are actively working to upstream. A Linux kernel fork implementing the protocol has been developed and maintained since March 2009. While there are some large MPTCP deployments using this custom kernel, an upstream implementation will make the protocol available on Linux devices of all flavors.

        MPTCP is closely coupled with TCP, but an implementation does not need to interfere with operation of normal TCP connections. Our roadmap for MPTCP in Linux begins with the server use case, where connections and additional TCP subflows are generally initiated by peer devices. This will start with RFC 6824 compliance, but with a minimal feature set to limit the code footprint for initial review and testing.

        The MPTCP upstreaming community has shared a RFC patch set on the netdev list that shows our progress and how we plan to build around the TCP stack. We'll share our roadmap for how this patch set will evolve before final submission, and discuss how this first step will differ from the forked implementation.

        Once we have merged our baseline code, we have plans to continue development of more advanced features for managing subflow creation (path management), scheduling outgoing packets across TCP subflows, and other capabilities important for client devices that initiate connections. This includes making use of a userspace path manager, which has an alpha release available already. In future kernel releases we will make use of additional TCP features and optimize MPTCP performance as we get more feedback from kernel users.

        Both the communication and the code are public and open. You can find us at and

      • 12:45
        Programmable socket lookup with BPF 45m

        At Netconf 2019 we have presented a BPF-based alternative to steering
        packets into sockets with iptables and TPROXY extension. A mechanism
        which is of interest to us because it allows (1) services to share a
        port number when their IP address ranges don't overlap, and (2) reverse
        proxies to listen on all available port numbers.

        The solution adds a new BPF program type BPF_INET_LOOKUP, which is
        invoked during the socket lookup. The BPF program is able to steer SKBs
        by overwriting the key used for listening socket lookup. The attach
        point is associated with a network namespace.

        Since then, we have been reworking the solution to follow the existing
        pattern of using maps of socket references for redirecting packets, that
        is REUSEPORT_SOCKARRAY, SOCKMAP, or XSKMAP. We expect to publish the
        next version of BPF_INET_LOOKUP RFC patch set, which addresses the
        feedback from Netconf, in August.

        During LPC 2019 BPF Microconference we would like to briefly recap on
        how BPF-driven socket lookup compares to classic bind()-based dispatch,
        TPROXY packet steering, and socket dispatch on TC ingress currently in
        development by Cilium.

        Next we would like discuss low-level implementation challenges. How to
        best ensure that packet delivery to connected UDP sockets remains
        unaffected? Can a BPF_INET_LOOKUP program co-exist with reuseport
        groups? Is there a possibility of code sharing with REUSEPORT_SOCKARRAY

        Following the implementation discussion, we will touch on performance
        aspects, that is what is the observed cost of running BPF during socket
        lookup both in SYN flood and UDP flood scenarios.

        Finally, we want to go into the usability of user-space API. Redirection
        with a BPF map of sockets raises a question who populates the map, and
        if existing network applications like NGINX need to be modified in any
        way to receive traffic steered with this new mechanism.

        The desired outcome of the discussion is to identify steps needed to
        graduate the patch set from an RFC series to a ready-for-review

      • 13:30
        Lunch 1h 30m
      • 15:00
        XDP bulk packet processing 45m

        It is well known that batching can often improve software performance. This is
        mainly because it utilizes the instruction cache in a more efficient way.
        From the networking perspective, the size of driver's packet processing
        pipeline is larger than the sizes of instruction caches. Even though NAPI
        batches packets over the full stack and driver execution, they are processed
        one by one by many large sub systems in the processing path. Initially this
        was raised by Jesper Brouer. With Edward Cree's listifying SKBs idea, the
        first implementation results look promising. How can we take this a step
        further and apply this technique to the XDP processing pipeline?

        To do that, the proposition is to back down from preparing xdp_buff struct
        one-by-one, passing it to XDP program and then acting on it, but instead we
        would prepare in driver an array of XDP buffers to be processed. Then, we
        would have only a single call per NAPI budget to XDP program, which would give
        us back a list of actions that driver needs to take. Furthermore, the number
        of indirect function calls, gets reduced, as driver gets to jited BPF program
        via indirect function call.

        In this talk I would like to present the proof-of-concept of described idea,
        which was yielding around 20% better XDP performance for dropping packets with
        touching headers memory (modified xdp1 from linux kernel's bpf samples).

        However, the main focus of this presentation should be a discussion about a
        proper, generic implementation, which should take place after showing out the
        POC, instead of the current POC. I would like to consider implementation
        details, such as:
        - would it be better to provide an additional BPF verifier logic, that when
        properly instrumented (make use of prologue/epilogue?), would emit BPF
        instructions responsible for looping over XDP program, or should we have the
        loop within the XDP programs?
        - the mentioned POC has a whole new NAPI clean Rx interrupt routine; what
        should we do to make it more generic in order to make driver changes
        - How about batching the XDP actions? Do all the drops first, then Tx/redirect,
        then the passes. Would that pay off?

      • 15:45
        LAG and hardware offload to support RDMA and IO virtualized interfaces 45m

        Link Aggregation (LAG) is traditionally served by bonding driver. Linux bonding driver supports all LAG modes on almost any LAN drivers - in the software. However modern hardware features like SR-IOV-based virtualization and state full offloads such as RDMA are currently not well supported by this model. One of possible options to solve that is to implement LAG functionality entirely in NIC's hardware or firmware. In our presentation we present another approach, where LAG functionality for state full offloads such as RDMA and IO virtualization is implemented mostly in software, with very limited support from existing Hardware and firmware. A concept that should make the solution more generic without complicating the HW any further.

        The presentation is focused on 3 areas: implementation of active-backup mode for RDMA and virtual functions, usage of RX hash value to implement flow-based active-active mode and new active-active mode for virtual functions.

        Proposed implementation of the active-backup mode for RDMA is done in RDMA and LAN drivers. An application continues using direct HW support for RDMA. LAN driver (with the help of RDMA driver) observes notifications from the bonding driver and accordingly controls low-level TX scheduling and RX rules for RDMA queues. The same mechanism can be used to transparently redirect network virtual functions from active to backup. We further explore the use of RX hash to implement active-active mode.

      • 16:30
        Break 30m
      • 17:00
        netfilter hardware offloads 45m

        With the advent of the the flow rule and flow block API, ethtool_rx, netfilter and tc can share the same infrastructure to represent hardware offloads.

        This presentation discusses the reuse of the existing infrastructure originally implemented by tc, such as the netdev_ops->ndo_setup_tc() interface and the TC_SETUP_CLSFLOWER classifier.

    • 10:00 13:30
      Testing and Fuzzing MC

      The Linux Plumbers 2019 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.

      Potential topics:

      Defragmentation of testing infrastructure: how can we combine testing infrastructure to avoid duplication.
      Better sanitizers: Tag-based KASAN, making KTSAN usable, etc.
      Better hardware testing, hardware sanitizers.
      Are fuzzers "solved"?
      Improving real-time testing.
      Using Clang for better testing coverage.
      Unit test framework. Content will most likely depend on the state of the patch series closer to the event.
      Future improvement for KernelCI. Bringing in functional tests? Improving the underlying infrastructure?
      Making KMSAN/KTSAN more usable.
      KASAN work in progress
      Syzkaller (+ fuzzing hardware interfaces)
      Stable tree (functional) testing
      KernelCI (autobisect + new testing suites + functional testing)
      Kernel selftests
      Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Sasha Levin and Dhaval Giani

      • 10:00
        kernelCI: testing a broad variety of hardware 35m

        kernelCI: testing a broad variety of hardware

        The Linux kernel runs on an extremely wide range of hardware, but
        with the rapid pace of kernel development, it's difficult to ensure
        the full range of supported hardware is adequately tested.

        The kernelCI project is a small, but growing project, focused on
        testing the core kernel on diverse set of architectures, boards and
        compilers using distributed labs to test hardware anywhere on the

        The goal of this presentation is to give a very brief overview of the
        project, and discuss the near-term future goals and plans.

        Recently added:
        - support for clang-build kernels
        - more arches: ARC, RISC-V, MIPS

        The future:
        - official Linux Foundation project launching
        - more tests: subsystem-focused test suites
        - more labs with more hardware
        - scaling of infrastructure
        - better reporting

        Speakers: Kevin Hilman (BayLibre), Guillaume Tucker (Collabora Limited)
      • 10:35
        Dealing with complex test suites 20m

        Boot testing is already hard to do well on a wide variety of
        hardware. However it is only scratching the surface of the
        kernel code base. To take projects such as Kernel CI to the next
        level and increase coverage, functional tests are becoming the
        next big thing on the list. Large test suites that run close to
        the hardware are very hard to tame. Some projects such as
        ezbench could become very helpful outside of its initial
        territory that is Intel graphics. But to start with, let us try
        to define the problem space and take a look at the state of the
        art in this area to then come up with ideas that apply to
        upstream kernel functional testing.

        Speaker: Guillaume Tucker (Collabora Limited)
      • 10:55
        GWP-ASAN 20m

        In this talk Dmitry will introduce the idea of GWP-ASAN, a sampling tool that finds use-after-free and heap-buffer-overflows bugs in production environments. GWP-ASan supplements the normal slab allocator and chooses random allocations to 'sample'. These sampled allocations are placed into a special guarded pool, which is based upon the traditional 'Electric Fence Malloc Debugger' idea. Dmitry will share experiences of using such tool in user-space and speculate about how useful such tool would be for kernel.

        Speaker: Dmitry Vyukov (Google)
      • 11:15
        Fighting uninitialized memory in the kernel 15m

        During the last two years, KMSAN (a detector of uses of uninitialized
        memory based on compiler instrumentation) has found more than a
        hundred bugs in the upstream kernel.
        We'll discuss the current status of the tool, some of its findings and
        implementation challenges. Ideally, I'd like to get more people to
        look at the code, as finding bugs in particular subsystems may require
        deeper knowledge of those subsystems.
        Another thing that'll be covered is the new stack and heap
        initialization features that will hopefully prevent most of the bugs
        related to uninitialized memory in the kernel.

        Speaker: Alexander Potapenko (Google)
      • 11:30
        Break 30m
      • 12:00
        syzbot: update and open problems 20m

        In this talk, Dmitry will share updates on syzkaller/syzbot since last year: USB fuzzing, bisection, memory leaks. Talk about open problems: testability of kernel components; test coverage; syzbot process.

        Speaker: Dmitry Vyukov (Google)
      • 12:20
        Collaboration/unification around unit testing frameworks 30m

        From the initial reactions and interest I have seen wrt. KTF
        and the discussions on LKML around KUnit (,
        it seems there's a general belief that some form of unit test framework
        like these can be a good addition to the tools and infrastructure already available
        in the kernel.

        It seems however that different people have different notions about what
        and how such a framework should ideally look, and what features belong there.
        I'd like to see if we can bring that discussion forward by focusing on
        some of these items, where people seem to have quite differing views
        depending on where they come from. Here is a non extensive list of
        some topics that seems to pop up when this gets discussed:

        • "Purity" of unit testing - what constitutes a "unit" in the kernel?
        • Testing kernel code - user space vs kernel space? (both useful)
        • Immediate development/debugging requirements vs longer term needs
        • Driver/hardware interaction testing?
        • "Neat"-factor
        • ease of use
        • Network testing (more than 1 kernel involved)
        • How to best integrate with existing test infrastructure in the
        • Unification and simpliciation options

        I'd like to make a short intro into this, and hopefully we can have some
        good exchange based on that.

        Speaker: Dr Knut Omang (Oracle)
      • 12:50
        All about Kselftest 40m

        Kselftest started out as an effort to enable a developer-focused regression test framework in the kernel to ensure the quality of new kernel releases. Today it is an integral part of the Linux Kernel development process to qualify Linux mainline and stable release candidates.

        Shuah will go over the Kselftest framework, how to write tests that work well with the framework for effective reporting of results. In addition, Shuah will discuss how the framework is tailored for developers as well as users to serve their individual and unique needs and discuss future plans.

        Speakers: Shuah Khan (The Linux Foundation), Anders Roxell, Dan Rue
    • 10:00 13:30
      Toolchains MC

      The goal of the Toolchains Microconference is to focus on specific topics related to the GNU Toolchain and Clang/LLVM that have a direct impact in the development of the Linux kernel.

      The intention is to have a very practical MC, where toolchain and kernel hackers can engage and, together:

      Identify problems, needs and challenges.
      Propose, discuss and agree on solutions for these specific problems.
      Coordinate on how to implement the solutions, in terms of interfaces, patches submissions, etc in both kernel and toolchain component.

      Consequently, we will discourage vague and general "presentations" in favor of concreteness and to-the-point discussions, encouraging the participation of everyone present.

      Examples of topics to cover:

      Header harmonization between kernel and glibc.
      Wrapping syscalls in glibc.
      eBPF support in toolchains.
      Potential impact/benefit/detriment of recently developed GCC optimizations on the kernel.
      Kernel hot-patching and GCC.
      Online debugging information: CTF and BTF

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Jose E. Marchesi and Elena Zannoni

    • 15:00 18:30
      Scheduler MC

      The Linux Plumbers 2019 Scheduler Microconference is about all scheduler topics, which are not Realtime

      Potential topics:
      - Load Balancer Rework - prototype
      - Idle Balance optimizations
      - Flattening the group scheduling hierrachy
      - Core scheduling
      - Proxy Execution for CFS
      - Improving scheduling latency with SCHED_IDLE task
      - Scheduler tunables - Mobile vs Server
      - nohz
      - LISA for scheduler verification

      We plan to continue the discussions that started at OSPM in May'19 and get a wider audience outside the core scheduler developers at LPC.

      Potential attendees:
      Juri Lelli
      Vincent Guittot
      Subhra Mazumdar
      Daniel Bristot
      Dhaval Giani
      Paul Turner
      Rik van Riel
      Patrick Bellasi
      Morten Rasmussen
      Dietmar Eggman
      Steven Rostedt
      Thomas Gleixner
      Viresh Kumar
      Phil Auld
      Waiman Long
      Josef Bacik
      Joel Fernandes
      Paul McKenney
      Alessio Balsini
      Frederic Weisbecker

      This microconference is picking scheduler topics which are not RT, but this should take place either immediately before or after that MC.

      MC leads:
      Juri Lelli, Vincent Guittot, Daniel Bristot de Oliveira, Subhra Mazumdar, Dhaval Giani

      • 15:00
        Core scheduling 45m

        There have been two different approaches proposposed on the LKML over the past year on core scheduling. One was the coscheduling approach by Jan Schönherr, originally posted at and the next version posted at

        Upstream chose a different route and decided to modify CFS, and only do "core-scheduling". Vineeth picked up the patches from Peter Zijlstra. This is a discussion on how we can further that work, especially when there are security implications such as L1TF and MDS, which make important this work to go upstream.

        Aubrey Li will talk about Core scheduling: Fixing when fast instructions go slow

        Keeping system utilization high is important both to keep costs down and to keep energy efficiency up. That often means tightly packing compute jobs and using the latest processor features. However, these approaches can be at odds when a new processor feature like AVX512 is used. The performance of latency critical jobs can be reduced by 10% if co-located with deep learning training jobs. These jobs use AVX512 instructions to accelerate wide vector operations. Whenever a core executes AVX512 instructions, the core automatically reduces its frequency. This can lead to a significant overall performance loss for a non-AVX512 job on the same core. In this presentation, we will discuss how to preserve performance while still allowing AVX512-based acceleration.

        AVX512 task detection
        - From user space, PMU events can be used but it's expensive.
        - In the kernel, I proposed to expose process AVX512 usage elapsed time as a heuristic hint.
        - Discuss an interface for tasks in cgroup.

        AVX512 task isolation
        - Discuss kernel space solution, if the recent proposal core scheduling can be leveraged for isolation.
        - Discuss user space solution, if user space job scheduler is better than kernel scheduler

      • 15:45
        Proxy Execution 15m

        Proxy execution can be considered as a generalization of the real-time priority inheritance mechanism. With proxy execution a task can run using the context of some other task that is "willing" to let the first task run as this improves performace for both. With this topic I'd like to detail about progress that has been made after the initial RFC posting on LKML and discuss about open problems and questions.

        Speaker: Juri Lelli (Red Hat)
      • 16:00
        Making SCHED_DEADLINE safe for kernel kthreads 30m

        Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods, not millisecond, not seconds, not minutes, not even hours, but days. Given that RCU CPU stall warnings are issued whenever an RCU grace period fails to complete within a few tens of seconds, the system did not suffer silently. Although one could argue that people should avoid abusing sched_setattr(), people are human and humans make mistakes. Responding to simple mistakes with RCU CPU stall warnings is all well and good, but a more severe case could OOM the system, which is a particularly unhelpful error message.

        It would be better if the system were capable of operating reasonably despite such abuse. Several approaches have been suggested.

        First, sched_setattr() could recognize parameter settings that put kthreads at risk and refuse to honor those settings. This approach of course requires that we identify precisely what combinations of sched_setattr() parameters settings are risky, especially given that there are likely to be parameter settings that are both risky and highly useful.

        Second, in theory, RCU could detect this situation and take the "dueling banjos" approach of increasing its priority as needed to get the CPU time that its kthreads need to operate correctly. However, the required amount of CPU time can vary greatly depending on the workload. Furthermore, non-RCU kthreads also need some amount of CPU time, and replicating "dueling banjos" across all such Linux-kernel subsystems seems both wasteful and error-prone. Finally, experience has shown that setting RCU's kthreads to real-time priorities significantly harms performance by increasing context-switch rates.

        Third, stress testing could be limited to non-risky regimes, such that kthreads get CPU time every 5-40 seconds, depending on configuration and experience. People needing risky parameter settings could then test the settings that they actually need, and also take responsibility for ensuring that kthreads get the CPU time that they need. (This of course includes per-CPU kthreads!)

        Fourth, bandwidth throttling could treat tasks in other scheduling classes as an aggregate group having a reasonable aggregate deadline and CPU budget. This has the advantage of allowing "abusive" testing to proceed, which allows people requiring risky parameter settings to rely on this testing. Additionally, it avoids complex progress checking and priority setting on the part of many kthreads throughout the system. However, if this was an easy choice, the SCHED_DEADLINE developers would likely have selected it. For example, it is necessary to determine what might be a "reasonable" aggregate deadline and CPU budget. Reserving 5% seems quite generous, and RCU's grace-period kthread would optimally like a deadline in the milliseconds, but would do reasonably well with many tens of milliseconds, and absolutely needs a few seconds. However, for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might well need a full CPU each! (This happens when the CPU being offloaded generates a high rate of callbacks.)

        The goal of this proposal is therefore to generate face-to-face discussion, hopefully resulting in a good and sufficient solution to this problem.

        Speaker: Paul McKenney (IBM Linux Technology Center)
      • 16:30
        Break 30m

        Coffee, Tea and Snacks

      • 17:00
        CFS load balance rework 30m

        The cfs load_balance has became more and more complex over the years and has reached the point where policy can't be explained sometimes. Furthermore, available metrics have evolved and load balance doesn't always take full advantage of it to calculate the imbalance. It's probably the good time to do a rework of the load balance code as proposed in this patchset:
        In addition to this patchset , we could discuss the next evolution that could be done on the load_balance

        Speaker: Vincent Guittot (Linaro)
      • 17:30
        flattening the hierarchy discussion 15m

        There is a presentation in the refereed track on flattening the CPU controller runqueue hierarchy, but it may be useful to have a discussion on the same topic in the scheduler microconference.

        Speaker: Rik van Riel (Facebook)
      • 17:45
        Scheduler domains and cache bandwidth 15m

        The Linux Kernel scheduler represents a system's topology by the means of
        scheduler domains. In the common case, these domains map to the cache topology
        of the system.

        The Cavium ThunderX is an ARMv8-A 2-node NUMA system, each node containing
        48 CPUs (no hyperthreading). Each CPU has its own L1 cache, and CPUs within
        the same node will share a same L2 cache.

        Running some memory-intensive tasks on this system shows that, within a
        given NUMA node, there are "socklets" of CPUs. Executing those tasks
        (which involve the L2 cache) on CPUs of the same "socklet" leads to a reduction
        of per-task memory bandwidth.
        On the other hand, running those same tasks on CPUs of different "socklets"
        (but still within the same node) does not lead to such a memory bandwidth

        While not truly equivalent to sub-NUMA clustering, such a system could benefit
        from a more fragmented scheduler domain representation, i.e. grouping these
        "socklets" in different domains.

        This talk will be an opportunity to discuss ways for the scheduler to leverage
        this topology characteristic and potentially change the way scheduler domains
        are built.

        Speaker: Valentin Schneider (Arm Ltd)
      • 18:00
        TurboSched: Core capacity Computation and other challenges 15m

        Turbosched is a proposed scheduler enhancement that aims to sustain turbo frequencies for a longer duration by explicitly marking small tasks that are known to be jitters and pack them on a smaller number of cores. This ensures that the other cores will remain idle, and the energy thus saved can be used by CPU intensive tasks for sustaining higher frequencies for a longer duration.

        The current TurboSched RFCv4 ( has some challenges:

        • Core Capacity Computation: Spare core capacity defines the upper bound for task packing above which jitter tasks should not be packed further into a core, else it hurts the performance of the other tasks running on that core. To achieve this we need a mechanism to compute the capacity of the cores in terms of its active SMT threads. But the computation of CPU Capacity itself if arguable and non-reliable in case of CPU hotplug events. This makes the TurboSched to have unexpected behavior in case of hotplugs or in presence of asymmetric CPU capacities. The discussion also involves the use of other parameters like nr_running with utilization to decide upper bound for task packing.

        • Interface: There are multiple approaches to mark a small-task as a jitter. A cgroup based approach is favorable to the distros as it is a well-understood interface requiring minimal modification for the existing tools. However, the kernel community has expressed objection to this interface since whether a task is jitter or not is a task-attribute and not a task-group attribute. Further, a task being a jitter is not a resource-partition problem, which is what cgroup aims to solve. The other approach would be to define this via a sched_attribute which can be updated via an existing syscall. Finally, we can support both the approaches as discussed on LWN

        • Limiting the Search Domain for packing: On systems with a large number of CPUs, searching all the CPUs where the small-tasks should be packed can be expensive in the task-wakeup path. Hence we should limit the
          domain of CPUs over which the search is conducted. In the current implementation, TurboSched uses the DIE domain to pack tasks on PowerPC, but certain architectures might prefer the LLC or the NUMA domains. Thus we need to discuss a unified way of describing the search domain which can work across all architectures.

        This topic is a continuation from the OSPM talk and aims to mitigate these problems generic across architectures.

        Speaker: Parth Shah
      • 18:15
        Task latency-nice 15m

        Currently there is no user control on how much time scheduler should spend searching for CPUs when scheduling a task. It is hardcoded logic based on some heuristics that doesn't work well in many cases. e.g. very short running tasks. Provide a new latency-nice property user can set for a task (similar to nice value) that controls the search time and also potentially the preemption logic. Also discuss best interfaces to have this (potentially Cgroups).

        Speaker: Subhra Mazumdar
    • 15:00 18:40

      The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems (eg RDMA, peer-to-peer, CCIX, PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to managed them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.

      The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.

      Following up the successful LPC 2017 VFIO/IOMMU/PCI microconference, the Linux Plumbers 2019 VFIO/IOMMU/PCI track will therefore focus on promoting discussions on the current kernel patches aimed at VFIO/IOMMU/PCI subsystems with specific sessions targeting discussion for kernel patches that enable technology (eg device/sub-device assignment, peer-to-peer PCI, IOMMU enhancements) requiring the three subsystems coordination; the microconference will also cover VFIO/IOMMU/PCI subsystem specific tracks to debate patches status for the respective subsystems plumbing.

      Tentative topics for discussion:

      Shared Virtual Addressing (SVA) interface
      SRIOV/PASID integration
      Device assignment/sub-assignment
      IOMMU drivers SVA interface consolidation
      IOMMUs virtualization
      IOMMU-API enhancements for mediated devices/SVA
      Possible IOMMU core changes (like splitting up iommu_ops, better integration with device-driver core)
      DMA-API layer interactions and how to get towards generic dma-ops for IOMMU drivers
      Resources claiming/assignment consolidation
      PCI error management
      PCI endpoint subsystem
      prefetchable vs non-prefetchable BAR address mappings (cacheability)
      Kernel NoSnoop TLP attribute handling
      CCIX and accelerators management
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Bjorn Helgaas, Lorenzo Pieralisi, Joerg Roedel, and Alex Williamson

      • 15:00
        User interfaces for per-group default domain type 25m

        This topic will discuss 1) why do we need per-group default domain type, 2) how it solves the problems in the real IOMMU driver, and 3) the user interfaces.

        Speaker: Baolu Lu
      • 15:25
        VFIO/IOMMU/PCI speaker change 5m
      • 15:30
        Status of Dual Stage SMMUv3 integration 25m

        Since August 2018 I have been working on SMMUv3 nested stage integration
        at IOMMU/VFIO levels, to allow virtual SMMUv3/VFIO integration.

        This shares some APIs with the Intel and ARM SVA series (cache invalidation,
        fault reporting) but also introduces some specific ones to pass information
        about guest stage 1 configuration and MSI bindings.

        In this session I would like to discuss the upstream status and get a chance
        to clarify open points. This is also an opportunity to synchronize about the VFIO fault reporting requirements for recoverable errors.

        Speaker: Eric Auger (Red Hat)
      • 15:55
        VFIO/IOMMU/PCI speaker change 5m
      • 16:00
        PASID Management in Linux 25m

        PASID (Process Address Space ID) is a PCIe capability that enables sharing of a single device across multiple isolated address domains. It has been becoming a hot term in I/O technology evolution. e.g. it is foundation of SVM and SIOV. Combined with the usages of PASID and the configuration difference due to architecture difference across vendors, it brings an interesting topic on PASID management in Linux. Especially regards to software complexity for VM live migration support in cloud. This talk will review the PASID usages and configuration methods, then elaborate the gaps for PASID management. Finally propose a solution and start talks with peers.

        Speaker: Mr Pan Jacob (Intel)
      • 16:25
        VFIO/IOMMU/PCI speaker change 5m
      • 16:30
        Architecture considerations for vfio/iommu handling 15m

        While x86 is probably the most prominent platform for vfio/iommu development and usage, other architectures also see quite a bit of movement. These architectures are similar to x86 in some parts and quite different in others; therefore, sometimes issues come up that may be surprising to folks mostly working on more common platforms.

        For example, PCI on s390 is using special instructions. QEMU needs to fill in 'real' values for some memory-layout values for devices passed via vfio and needs a way to retrieve them.

        Other architectures (e.g. ARM) may also have some unusual requirements not obvious to people not working on those platforms. It seems beneficial to at least raise awareness of those issues so that we don't end up with interfaces/designs that are hard to implement or not sufficient on less common platforms.

        Speaker: Cornelia Huck
      • 16:45
        VFIO/IOMMU/PCI main break 20m
      • 17:05
        Optional or reduced PCI BARs 25m

        Modern PCI graphics devices may contain several gigabytes of memory mapped in its BAR. This trend is continuing into storage with NVMe devices containing large Controller Memory Buffers and Persistent Memory Regions.

        Some PCI hierarchies are resource constrained and cannot fit as many devices as desired. In NVMe's case, it's preferable to enumerate and attach all devices rather than use the entire memory window for one or two devices with large, optional BARs.

        Current PCI core architecture will prevent a PCI device from being enabled if any of the BARs are unset. This proposal is about a way to hint at the PCI layer that some BARs are optional and could be omitted or reduced (by limiting it at the bridge window) in order to keep such devices enabled.

        Speaker: Jonathan Derrick
      • 17:30
        VFIO/IOMMU/PCI speaker change 5m
      • 17:35
        PCI Resources assignment policies 25m

        This is meant to be a rather open discussion on PCI resource assignment policies. I plan to discuss a bit what the different arch/platforms do today, how I've tried to consolidate it, then we can debate the pro/cons of the different approaches and decide where to go from there.

        Speaker: Benjamin Herrenschmidt (Amazon AWS)
      • 18:00
        VFIO/IOMMU/PCI speaker change 5m
      • 18:05
        Use IOMMU to prevent DMA attacks from Thunderbolt devices 15m

        The Thunderbolt vulnerabilities are public and have a nice name as Thunderclap ( nowadays. This topic will introduce what kind of vulnerabilities we have identified with Linux and how we are fixing them.

        Speaker: Baolu Lu
      • 18:20
        VFIO/IOMMU/PCI speaker change 5m
      • 18:25
        Implementing NTB controller using PCIe endpoint 15m

        A PCI-Express non-transparent bridge (NTB) is a point-to-point PCIe bus
        connecting 2 host systems. NTB functionality can be achieved in a platform
        having 2 endpoint instances. Here each of the endpoint instance will be
        connected to an independent host and the hosts can communicate with each other
        using endpoint as a bridge. The endpoint framework and the "new" NTB EP
        function driver should configure the endpoint instances in such a way that the
        transactions from one endpoint is routed to the other endpoint instance. The
        host will see the connected endpoint as an NTB port and the existing NTB tools
        (ntb_pingpong, ntb_perf) in Linux kernel could be used.

        Speaker: Mr Kishon Vijay Abraham I
    • 15:00 18:30
      You, Me, and IoT MC

      The Internet of Things (IoT) has been growing at an incredible pace as of late.

      Some IoT application frameworks expose a model-based view of endpoints, such as

      on-off switches
      dimmable switches
      temperature controls
      door and window sensors

      Other IoT application frameworks provide direct device access, by creating real and virtual device pairs that communicate over the network. In those cases, writing to the virtual /dev node on a client affects the real /dev node on the server. Examples are

      GPIO (/dev/gpiochipN)
      I2C (/dev/i2cN)
      SPI (/dev/spiN)
      UART (/dev/ttySN)

      Interoperability (e.g. ZigBee to Thread) has been a large focus of many vendors due to the surge in popularity of voice-recognition in smart devices and the markets that they are driving. Corporate heavyweights are in full force in those economies. OpenHAB, on the other hand, has become relatively mature as a technology and vendor agnostic open-source front-end for interacting with multiple different IoT frameworks.

      The Linux Foundation has made excellent progress bringing together the business community around the Zephyr RTOS, although there are also plenty of other open-source RTOS solutions available. The linux-wpan developers have brought 6LowPan to the community, which works over 802.15.4 and Bluetooth, and that has paved the way for Thread, LoRa, and others. However, some closed or quasi-closed standards must rely on bridging techniques mainly due to license incompatibility. For that reason, it is helpful for the kernel community to preemptively start working on application layer frameworks and bridges, both community-driven and business-driven.

      For completely open-source implementations, experimental results have shown results with Greybus, with a significant amount of code already in staging. The immediate benefits to the community in that case are clear. There are a variety of key subjects below the application layer that come into play for Greybus and other frameworks that are actively under development, such as

      Device Management
      are devices abstracted through an API or is a virtual /dev node provided?
      unique ID / management of possibly many virtual /dev nodes and connection info
      Network Management
      standards are nice (e.g. 802.15.4) and help to streamline in-tree support
      non-standard tech best to keep out of tree?
      userspace utilities beyond command-line (e.g. NetworkManager, NetLink extensions)
      Network Authentication
      re-use machinery for e.g. 802.11 / 802.15.4 ?
      generic approach for other MAC layers ?
      in userspace via e.g. SSL, /dev/crypto
      Firmware Updates
      generally different protocol for each IoT framework / application layer
      Linux solutions should re-use components e.g. SWUpdate
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      This Microconference will be a meeting ground for industry and hobbyist contributors alike and promises to shed some light on the what is yet to come. There might even be a sneak peak at some new OSHW IoT developer kits.

      The hope is that some of the more experienced maintainers in linux-wpan, LoRa and OpenHAB can provide feedback and suggestions for those who are actively developing open-source IoT frameworks, protocols, and hardware.

      MC leads
      Christopher Friedt, Jason Kridner, and Drew Fustini

    • 10:00 13:30
      Databases MC

      Databases utilize and depend on a variety of kernel interfaces and are critically dependent on their specification, conformance to specification, and performance. Failure in any of these results in data loss, loss in revenue, or degraded experience or if discovered early, software debt. Specific interfaces can also remove small or large parts of user space code creating greater efficiencies.

      This microconference will get a group of database developers together to talk about how their databases work, along with kernel developers currently developing a particular database-focused technology to talk about its interfaces and intended use.

      Database developers are expected to cover:

      The architecture of their database;
      The kernel interfaces utilized, particularly those critical to performance and integrity
      What is a general performance profile of their database with respect to kernel interfaces;
      What kernel difficulties they have experienced;
      What kernel interfaces are particularly useful;
      What kernel interfaces would have been nice to use, but were discounted for a particular reason;
      Particular pieces of their codebase that have convoluted implementations due to missing syscalls; and
      The direction of database development and what interfaces to newer hardware, like NVDIMM, atomic write storage, would be desirable.
      The aim for kernel developers attending is to:

      Gain a relationship with database developers;
      Understand where in development kernel code they will need additional input by database developers;
      Gain an understanding on how to run database performance tests (or at least who to ask);
      Gain appreciation for previous work that has been useful; and
      Gain an understanding of what would be useful aspects to improve.
      The aim for database developers attending is to:

      Gain an understanding of who is implementing the functionality they need;
      Gain an understanding of kernel development;
      Learn about kernel features that exist, and how they can be incorporated into their implementation; and
      Learn how to run a test on a new kernel feature.
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Daniel Black

      • 10:00
        Open Session 5m

        Quick introduction of people. Frame discussion. Will be quick I promise.

        Speaker: Daniel Black (IBM)
      • 10:05
        io_uring - excitement - looking for feedback & potential issues 15m

        many devs are excited about the progress reported on this new stuff, but is it followed / considered by kernel devs.? what kind of gain to expect? any potential issues or feedback to share?

        Speaker: Dimitri KRAVTCHUK
      • 10:20
        disk write barriers 20m

        for example, for a write-ahead logging, one needs to guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is an overkill for this.

        Speaker: Sergei Golubchik
      • 10:40
        Filesystem atomic writes / O_ATOMIC 15m

        seems like the patches proposed by Fusion-io devs for general O_ATOMIC support within Linux kernel are in stand-by since 6 years.. -- any plans to address it ?.. What is the main reason to not guarantee atomicy of O_DIRECT writes on flash drives? -- seems like most of flash storage vendors are able to provide atomic writes support on HW level, and just SW level (kernel/FS/etc.) is missed.. The main benefit for MySQL/InnoDB is to get a rid of "double write" to protect from data corruption (partially written pages) -- so, every page is written twice, increasing IO write traffic + doubling page write latency + reducing by half flash drive life expectation..

        Speaker: Dimitri KRAVTCHUK
      • 10:55
        MySQL @EXT4 performance impacts with latest Linux kernels 20m

        since newer kernels (4.14, 5.1, ..) we are observing 50% regression on MySQL IO-bound workloads using EXT4 comparing to the same results on the same HW, but running kernel 3.x or 4.1. Unfortunately we have absolutely no explanation for this regression right now and looking for any available FS layer instrumentation/visibility to understand what is the root problem for such a regression and how it can be by-passed from MySQL code (or fixed if the problem is in EXT4)..

        (more details are expected up to conference date)

        Speaker: Dimitri KRAVTCHUK
      • 11:15
        MySQL @XFS 15m

        historically XFS was always showing lower performance comparing to EXT4 on most of IO-bound workloads used for MySQL/InnoDB benchmark testing.. However, since the new kernels & XFS arrived, we observed significantly better results on XFS now -vs- EXT4 particularly when InnoDB "double write" is enabled. From the other side, for our big surprise, XFS was doing worse if "double write" was disabled (which is nonsense, because how overall performance can be worse if we do twice less IO writes on the same IO-bound workload?) -- fortunately we found a workaround to by-pass this issue, but still lacking deep understanding of the problem and observation/ visibility details from XFS layer).. -- all is looking like a kind of IO starvation, but how it can be detected on time and ahead?..

        (more details are expected up to conference date)

        Speaker: Dimitri KRAVTCHUK
      • 11:30
        Break 30m
      • 12:00
        What SQLite Devs Wish Linux Filesystem Devs Knew About SQLite 7m

        (1) SQLite is the most widely used database in the world. There are probably in excess of 300 billion active SQLite databases on Linux devices. SQLite is a significant client of the Linux filesystem - perhaps the largest single non-streaming client, especially on small devices such as phones.

        (2) Unlike other relational database engines, SQLite tends to live out on the edge of the network, not in the datacenter.

        (3) An SQLite database is a single ordinary file in the filesystem. The database file format is well-defined and stable. The US Library of Congress designates SQLite database files as a recommended format for long-term archive storage of structured data.

        (4) SQLite is not a client/server database. SQLite is a library. The application makes a function call that contains SQL text and SQLite translates that SQL into a sequence of filesystem operations that implement the desired operation, all within the same thread. There is no messaging and no IPC. There is no server process that hangs around to coordinate access to the database file.

        (5) SQLite does not get to choose a filesystem type or mount options. It has to make due with whatever is at hand. Therefore, SQLite really wants to be able to discover filesystem properties at run-time, so that it can tune its behavior for maximum performance and reliability.

        (6) Diagrams showing how SQLite creates the illusion of atomic commit on a non-atomic filesystem.

        Speaker: Dr Richard Hipp (SQLite)
      • 12:07
        IO: Durability, Errors and Documentation 20m

        Postgres (and many other databases) have, until fairly recently, assumed that IO errors would a) be reliably signalled by fsync/fdatasync/... b) repeating an fsync after a failure would either result in another failure, or the IO operations would succeed.

        That turned out not to be true: See also

        While a few improvements have been made, both in postgres and linux, the situation is still pretty bad.

        From my point of view, a large part of the problem is that linux does not document what error and durability behaviour userspace can expect from certain operations.

        Problematic areas for the kernel:
        - The regular behaviour of durability fs related syscalls are not documented. One extreme example of that is sync_file_range (look at the warning section of the manpage)
        - FS behaviour when encountering IO errors is poorly, if at all, documented. For example: there still is no documentation about the error behaviour of fsync, ext4's errors= operation reads as if it applied to all IO errors, but only applies to metadata errors.
        - There is very little consistency for error behaviour between filesystems. To the degree that XFS will return different data after writeback failed than ext4.
        - There is no usable interface to query / be notified of IO errors
        - the rapid development of thin provisioned storage has increased the likelihood of IO errors drastically, as large parts of the IO stack treat out-of-space on the block level as an IO error

        It seems worthwhile to work together to at least partially clean this up.

        Speakers: Andres Freund (EnterpriseDB / PostgreSQL), Mr Tomas Vondra (Postgresql)
      • 12:27
        Time series of thread profiles in production 15m

        At MongoDB, we implemented an eBPF tool to collect and display a complete time-series view of information about all threads whether they are on- or off-CPU. This allows us to inspect where the database server spends its time, both in userspace and in kernel. Its minimal overhead allows to deploy it in production.

        This can be an effective method to collect diagnostic information in the field and surface a specific workload which is bound by a syscall. It would be interesting to hear what solution other vendors use to profile in production.

        Speaker: Josef Ahmad (MongoDB Inc.)
      • 12:42
        New InnoDB REDO log design and MT sync challenges 15m

        since MySQL 8.0 we have a newly redesigned lock-free REDO log implementation. However, this development involved several questions about overall efficiency around MT communications and synchronization. Curiously spinning on CPU showed to be the most efficient on low load.. -- but any plans to implement "generic" MT framework for more efficient execution of any MT apps ?

        Speaker: Mr Pawel OLCHAWA
      • 12:57
        IP / UNIX Socket Backlog 15m

        there is "backlog" option used in MySQL for both IP and UNIX sockets, but seems like it has a significant overhead on heavy connect/disconnect activity workloads (e.g. like most of Web apps which are doing "connect; SQL query; disconnect") -- any explanation/ reason for this? can it be improved?

        Speaker: Dimitri KRAVTCHUK
      • 12:57
        IP port -vs- UNIX socket difference on - IP stack is 20-30% slower on MySQL 15m

        MySQL is allowing user sessions connections via IP port and UNIX socket on Linux systems. However, curiously connecting via UNIX socket is delivering up to 30% higher performance comparing to IP local port (loopback).. -- any reason for this? and be "loopback" code improved to match the same level of efficiency as UNIX socket? can the same improvements make over all IP stack to be more efficient?

        Speaker: Dimitri KRAVTCHUK
      • 13:12
        Regressions due CPU cache issues and missed visibility in Linux/kernel instrumentation 10m

        all MT apps are extremely sensible to CPU cache issues, and MySQL/InnoDB is part of them.. Several times we observed significant regressions (up to 40% and more) due CPU cache miss or simple cache sync due concurrent access to the same variable by several threads, and all "perf" CPU related stats did not show any difference.. Any plans to address it with more deep CPU stats instrumentation?

        Speaker: Mr Pawel OLCHAWA
      • 13:12
        Syscall overhead from Spectre/Meltdown fixes 10m

        users are very worry about any kind of overhead due kernel patches applied to solve Intel CPU issues (Spectre/Meltdown/etc.) -- what others are observing? what kind of workloads / test cases do you use for evaluation?

        Speaker: Dimitri KRAVTCHUK
      • 13:22
        Conclusion 8m

        From discussions to code. Where it goes from here?

        Speaker: Daniel Black (IBM)
    • 10:00 18:30
      Kernel Summit Track
      • 11:30
        Break 30m
      • 13:30
        Lunch 1h 30m
      • 16:30
        Break 30m
    • 10:00 18:30
      LPC Refereed Track
      • 10:00
        BPF is eating the world, don't you see? 45m

        The BPF VM in the kernel is being used in ever more scenarios where running a restricted, validated program in kernel space provides a super powerful mix of flexibility and performance which is transforming how a kernel work.

        That creates challenges for developers, sysadmins and support engineers, having tools for observing what BPF programs are doing in the system is critical.

        A lot has been done recently in improving tooling such as perf and bpftool to help with that, trying to make BPF fully supported for profiling, annotating, tracing, debugging.

        But not all perf tools can be used with JITed BPF programs right now, areas that need work, such as putting probes and collecting variable contents as well as further utilizing BTF for annotation are areas that require interactions with developers to gather insights for further improvements so as to have the full perf toolchest available for use with BPF programs.

        These recent advances and this quest for feedback about what to do next should be the topic of this talk.

        Speaker: Arnaldo Carvalho de Melo (Red Hat Inc.)
      • 10:45
        oomd2 and beyond: a year of improvements 45m

        Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. Perhaps surprisingly, the kernel does not often handle this well. oomd builds on top of recent kernel development to effectively implement OOM killing in userspace. This results in a faster, more predictable, and more accurate handling of OOM scenarios.

        oomd has gained a number of new features and interesting deployments in the last year. The most notable feature is a complete redesign of the control plane which enables arbitrary but "gotcha"-free configurations. In this talk, Daniel Xu will cover past, present, future, and path-not-taken development plans along with experiences gained from overseeing large deployments of oomd.

        Speaker: Daniel Xu (Facebook)
      • 11:30
        Break 30m
      • 12:00
        Integration of PM-runtime with System-wide Power Management 45m

        There are two flavors of power management supported by the Linux kernel: system-wide PM based on transitions of the entire system into sleep states and working-state PM focused on controlling individual components when the system as a whole is working. PM-runtime is part of working-state PM concerned about the opportunity to put devices into low-power states when they are not in use.

        Since both PM-runtime and system-wide PM act on devices in a similar way (that is, they both put devices into low-power states and possibly enable them to generate wakeup signals), optimizations related to the handling of already suspended devices can be made, at least in principle. In particular:
        It should be possible to avoid resuming devices already suspended by runtime PM during system-wide PM transitions to sleep states.
        It should be possible to leave devices suspended during system-wide PM transitions to sleep states in PM-runtime suspend while resuming the system from those states.
        * It should be possible to re-use PM-runtime callbacks in device drivers for the handling of system-wide PM.

        These optimizations are done by some drivers, but making them work in general turns out to be a hard problem. They are achieved in different ways by different drivers and some of them are in effect only in specific platform configurations. Moreover, there are no general guidelines or recipes that driver writers can follow in order to arrange for these optimizations to take place. In an attempt to start a discussion on approaching this problem space more consistently, I will give an overview of it, describe the solutions proposed and used so far and suggest some changes that may help to improve the situation.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 12:45
        Kernel Address Space Isolation 45m

        Recent vulnerabilities like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS) have shown that the cpu hyper-threading architecture is very prone to leaking data with speculative execution attacks.

        Address space separation is a proven technology to prevent side channel vulnerabilities when speculative execution attacks are used. It has, in particular, been successfully used to fix the Meltdown vulnerability with the implementation of Kernel Page Table Isolation (KPTI).

        Kernel Address Space Isolation aims to use address spaces to isolate some parts of the kernel to prevent leaking sensitive data under speculative execution attacks.

        A particularly good example is KVM. When running KVM, a guest VM can use speculative execution attacks to leak data from the sibling hyper-thread, thus potentially accessing data from the host kernel, from the hypervisor or from another VM, as soon as they run on the same hyper-thread.

        If KVM can be run in an address space containing no sensitive data, and separated from the full kernel address space, then KVM would be immune from leaking secrets no matter on which cpu it is running, and no matter what is running on the sibling hyper-threads.

        A first proposal to implement KVM Address Space Isolation has recently been submitted and got some good feedback and discussions:

        This presentation would show progress and challenges faced while implementing KVM Address Space Isolation. It also looks forward to discuss the possibility to have a more generic kernel address space isolation framework (not limited to KVM), and how it can be interfaced with the current memory management subsystem in particular.

        MERGED with:

        Address space isolation has been used to protect the kernel from the
        userspace and userspace programs from each other since the invention of
        the virtual memory.

        Assuming that kernel bugs and therefore vulnerabilities are inevitable
        it might be worth isolating parts of the kernel to minimize damage
        that these vulnerabilities can cause.

        Recently we've implemented a proof-of-concept for "system call
        isolation (SCI)" mechanism that allows running a system call with
        significantly reduced page tables. In our model, the accesses to a
        significant part of the kernel memory generate page faults, thus
        giving the "core kernel" an opportunity to inspect the access and
        refuse it on a pre-defined policy.

        Our first target for the system call isolation was an attempt to
        prevent ROP gadget execution [1], and despite its weakness it makes a
        ROP attack harder to execute and as a nice side effect SCI can be used
        as Spectre mitigation.

        Another topic of interest is a marriage between namespaces and address
        spaces. For instance, the kernel objects that belong to a particular
        network namespace can be considered as private data and they should
        not be mapped in other network namespaces.

        This data separation greatly reduces the ability of a tenant in one
        namespace to exfiltrate data from a tenant in a different namespace
        via a kernel exploit because the data is no longer mapped in the
        global shared kernel address space.

        We believe it would be helpful to discuss the general idea of address
        space isolation inside the kernel, both from the technical aspect of
        how it can be achieved simply and efficiently and from the isolation
        aspect of what actual security guarantees it usefully provides.


        Speakers: Alexandre Chartre (Oracle), James Bottomley (IBM), Mike Rapoport (IBM), Joel Nider (IBM Research)
      • 13:30
        Lunch 1h 30m
      • 15:00
        Enabling TPM based system security features 45m

        Nowadays all consumer PC/laptop devices contain TPM2.0 security chip (due to Windows hardware requirements). Also servers and embedded devices increasingly carry these TPMs. It provides several security functions to the system and the user, such as smartcard-like secure keystore and key operations, secure secret storage, bruteforce-protected access control, etc.

        These capabilities can be used in a multitude of scenarios and use cases, including disk encryption, device authentication, user authentication, network authentication, etc. of desktops/laptops, servers, IoTs, mobiles, etc.
        Utilizing the TPM requires several layers of software; the driver (inside the kernel), tpm middleware (a TSS implementation), security middleware (e.g. pkcs11), applications (e.g. ssh).

        This talk first gives an architectural overview of the hard-/software components involved in typical use cases. Then we will dive into a set of concrete use cases and on different ways in which they can be built up; these use cases will be related to device/user authentication around pkcs11 and openssl implementations.

        The talk will end with a list of software and works in progress for introducing TPM functionality to core applications. Finally, a list of potential projects for extending the utilization of the TPM in core software is presented. This latter list shall then drive the discussion on which software is missing or which software has cotributors attending that would like to include such features or which software is currently missing on the list. The current lists of core software are available and updated at

        Keywords: core libraries, device support, security, tpm, tss

        Speaker: Mr Andreas Fuchs (Fraunhofer SIT)
      • 15:45
        Utilizing tools made for "Big Data" to analyse Ftrace data - making it fast and easy 45m

        Tools based on low level tracing tend to generate large amounts of data, typically outputted in some kind of text or binary format. On the other hand the predefined data analysis features of those tools are often useless when it comes to solving a nontrivial or very user-specific problem. This is when the possibility to make sophisticated analysis via scripting can be extremely useful.

        Fast and easy scripting inside the tracing data is possible if we take advantage of the already existing infrastructure, originally developed for the purposes of the "Big Data" and ML industries. A PoC interface for accessing Ftrace data in Python (via NumPy arrays) will be demonstrated, together with few examples of analysis scripts. Currently the prototype of the interface is implemented as an extension of KernelShark. This is a work in progress, and we hope to receive advice from experts in the field to make sure the end result works seamlessly for them.

        Speaker: Yordan Karadzhov (VMware)
      • 16:30
        Break 30m
      • 17:00
        CPU controller on a single runqueue 45m

        The cgroups CPU controller in the Linux scheduler is implemented using hierarchical runqueues, which introduces a lot of complexity, and incurs a large overhead with frequently scheduling workloads. This presentation is about a new design for the cgroups CPU controller, which uses just one runqueue, and instead scales the vruntime by the inverse of the task priority. The goal is to make people familiar with the new design, so they know what is going on, and do not need to spend a month examining kernel/sched/fair.c to figure things out.

        Speaker: Rik van Riel (Facebook)
      • 17:45
        Formal verification made easy (and fast)! 45m

        Linux is complex, and formal verification has been gaining more and more attention because independent "asserts" in the code can be ambiguous and not cover all the desired points. Formal models aim to avoid such problems of natural language, but the problem is that "formal modeling and verification" sound complex. Things have been changing.

        What if I say it is possible to verify Linux behavior using a formal method?

        • Yes! We already have some models; people have been talking about it, but they seem to be very specific (Memory, Real-time...).

        What if I say it is possible to model many Linux subsystems, to auto-generate code from the model, to run the model on-the-fly, and that this can be as efficient as just tracing?

        • No way!

        Yes! It is! It is hard to believe, I know.

        In this talk, the author will present a methodology based on events and state (automata), and how to model Linux' complex behaviors with small and intuitive models. Then, how to transform the model into efficient C code, that can be loaded into the kernel on-the-fly to verify Linux! Experiments have also shown that this can be as efficient as tracing (sometimes even better)!

        This methodology can be applied on many the kernel subsystems, and the idea of this talk is also to discuss how to proceed towards a more formally verified Linux!

        Speaker: Daniel Bristot de Oliveira (Red Hat, Inc.)
    • 10:00 18:30
      Networking Summit Track
      • 10:45
        Linux Kernel VxLan with Multicast Routing for flood handling 45m

        The Linux kernel VxLan driver supports two ways of handling flooded traffic to multiple remote VxLan termination end points (VTEPS):
        (a) Head end replication: where the VxLan driver sends a copy of the packet to each participating remote VTEPs
        (b) Use of multicast routing to forward to participating remote VTEPs

        (b) is generally preferred for both hardware and software VTEP deployments because it scales better. The kernel VxLan driver supports (b) with static config today. One has to specify the multicast group with the outgoing uplink interface for VxLan multicast replication to work. This is mostly ok for deployments where VTEPs are deployed on the host/hypervisor. When deploying Linux VTEPs on the Top-Of-the-Rack (TOR) switches in a data center CLOS network, it is impossible to configure the outgoing interface statically. Typically a multicast routing protocol like PIM is used to dynamically calculate multicast trees and install forwarding paths for multicast traffic.

        In this talk we will cover:
        - Vxlan Multicast deployment scenarios with Vxlan VTEPs at the TOR switches
        - Current challenges with integrating Vxlan Multicast replication in a dynamic multicast routing environment
        - Solutions to these challenges: (a) Patches to fix routing of locally generated multicast packets (need for ip_mr_output) (b) Patches to VxLan driver to allow multicast replication without a static outgoing interface
        - Scale
        - Futures on VxLan deployments in multicast environment

      • 11:30
        Break 30m
      • 12:00
        SwitchDev offload optimizations 45m

        Linux has a nice SW bridge implementation which provides most of the classic
        Ethernet switching features. DSA and SwitchDev frameworks allow us to
        represent HW switch devices in Linux and potentially offload the SW forwarding
        to HW.

        But the offloading facilities are not perfect, and there seem to be room for
        further improvements:

        • Limiting the flooding of L2-Multicast traffic. IGMP snooping can limit the
          flooding of L3 traffic, but L2-Multicast traffic are always flooded.

        • Today all bridge slave interfaces are put into promiscuous mode to allow
          learning/flooding. But if the bridge is offloaded with HW capable of doing
          learning/learning, then this should not be necessary.

        • When not put into promiscuous mode, the struct net_device structure has a
          list of multicast addresses which should be received by the interface. But
          when VLAN sub-interfaces are created, the VLAN information is lost when
          addresses are installed in the mc list.

        • The assumption in the bridge code is that all multicast frames goes to the
          CPU. But what would it actually take only to request the needed multicast
          frames to the CPU?

        • Challenges in adding new redundancy and protection protocols to the kernel,
          and how to offload such protocols to HW.

        The intend with the talk is to present some of the issues we are facing in
        adding DSA/SwitchDev drivers for existing and near-time future HW. I will have few solutions to present, but will give our thoughts on how it may be solved. Hopefully with will result in good discussions and input from the audience.

        Background information: I'm working on a SwitchDev driver for a yet to be
        released HW Ethernet switch. It will be a TSN switch targeting industrial
        networks, with HW accelerators to implement redundancy protocols. CPU power are very limited, and latency are extremely important, which is why it is important for us to improve the HW offload facilities.

      • 12:45
        Future ipv4 unicast extensions 45m

        IPv4's success story was in carrying unicast packets
        Service sites still need IPv4 addresses for everything,
        since the majority of Internet client nodes don't yet
        have IPv6 addresses. IPv4 addresses now cost 15 to 20
        dollars apiece (times the size of your network!) and
        the price is rising.

        The IPv4 address space includes hundreds of millions of
        addresses reserved for obscure (the ranges 0/8, and
        127/16), or obsolete (225/8-231/8) reasons, or for
        "future use" (240/4 - otherwise known as class E).
        Instead of leaving these IP addresses unused, we have
        started an effort to make them usable, generally. This
        work stalled out 10 years ago, because IPv6 was going
        to be universally deployed by now, and reliance on IPv4
        was expected to be much lower than it in fact still is.

        We have been reporting bugs and sending patches to
        various vendors. For Linux, we have patches accepted
        in the kernel and patches pending for the
        distributions, routing daemons, and userland tools.
        Slowly but surely, we are decontaminating these IP
        addresses so they can be used in the near future.

        Many routers already handle many of these addresses,
        or can easily be configured to do so, and so we are
        working to expand unicast treatment of these addresses
        in routers and other OSes. We plan an authorized
        experiment to route some of these addresses globally,
        monitor their reachability from different parts of the
        Internet, and talk to ISPs who are not yet treating
        them as unicast to update their networks.

        Wouldn't it be a better world with a few hundred
        million more IPv4 addresses in it?

      • 13:30
        Lunch 1h 30m
      • 15:00
        Making the Kubernetes Service Abstraction Scale using eBPF 45m

        In this talk, we will present a scalable re-implementation of the Kubernetes service abstraction with the help of eBPF. We will discuss recent changes in the kernel which made the implementation possible, and some changes in the future which would simplify the implementation.

        Kubernetes is an open-source container orchestration multi-component distributed system. It provides mechanisms for deploying, maintaining and scaling applications running in containers across a multi-host cluster. Its smallest scheduling unit is called a pod. A pod consists of multiple co-located containers. Each pod has its own network namespace and is addressed by an unique IP address in a cluster. Network connectivity to and among pods is handled by an external plugin.

        Multiple pods which provide the same functionality can be grouped into services. Each service is reachable within a cluster via its virtual IP address allocated by Kubernetes. Also, a service can be exposed to outside of a cluster via the public IP address of a cluster host IP address and a port which is allocated by Kubernetes. Each request sent to a service is load-balanced to any of its pods.

        Kube-proxy is a Kubernetes component which is responsible for the service abstraction implementation. The default implementation is based on Netfilter's iptables. For each service and its pods it creates couple rules in the nat table which do a load-balancing to pods. For example, for the "nginx" service which virtual IP address is and which is running two pods with IP addresses and the following relevant iptables rules are created:

        -A KUBE-SERVICES -d -p tcp -m comment --comment "default/nginx: cluster IP" -m tcp --dport 80 -j KUBE-SVC-253L2MOZ6TC5FE7P
        -A KUBE-SVC-253L2MOZ6TC5FE7P -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-PCCJCD7AQBIZDZ2N
        -A KUBE-SEP-PCCJCD7AQBIZDZ2N -p tcp -m tcp -j DNAT --to-destination
        -A KUBE-SEP-UFVSO22B5A7KHVMO -p tcp -m tcp -j DNAT --to-destination

        It has been demonstrated [1][2][3] that kube-proxy due to its foundational technologies (Netfilter, iptables) is one of the major pain points when running Kubernetes at large scale from performance, reliability, and operations perspective.

        Cilium is an open-source networking and security plugin for container orchestration systems, such as Kubernetes. Unlike the majority of such networking plugins, it heavily relies on eBPF technology which lets one to dynamically reprogram the kernel.

        The most recent Cilium v1.6 release brings the implementation in eBPF of the Kubernetes service abstraction. This allows one to run a Kubernetes cluster without kube-proxy. Thus, it makes Kubernetes no longer dependent on Netfilter/iptables. This improves scalability and reliability of a Kubernetes cluster.

        No Kubernetes knowledge is required. The talk might be relevant for those who are interested in container networking with eBPF (loadbalancing, NAT).


      • 15:45
        Making Networking Queues a First Class Citizen in the Kernel 45m

        XDP (the eXpress Data Path) is a new method in Linux to process
        packets at L2 and L3 with really high performance. XDP has already
        been deployed for use cases involving ingress packet filtering, or
        transmission back through the ingress interface, are already well
        supported today. However, as we expand the use cases that involve the
        XDP_REDIRECT action, e.g., to send packets to other devices, or
        zero-copy them to userspace sockets, it becomes challenging to retain
        the high performance of the simpler operating modes.

        One of the keys to get good performance for these advanced use cases,
        is effective use of dedicated hardware queues (on both Rx and Tx), as
        this makes it possible to split traffic over multiple CPUs, with no
        synchronization overhead in the fast path. The problem with using
        hardware queues like this is that they are a constrained resource, but
        are hidden from the rest of the kernel: Currently, each driver
        allocates queues according to its own whims, and users have little or
        no control over how the queues are used or configured.

        In this presentation we discuss an abstraction that makes it possible
        to keep track of queues in a vendor-neutral way: We implement a new
        submodule in the Linux networking core that drivers can register their
        queues to. Other pieces of code can then allocate and free individual
        queues (or sets of them) satisfying certain properties (e.g., "a Tx/Rx
        pair", or "one queue per core"). This submodule also makes sure that
        the queues get IDs that are hardware independent, so that they can
        easily be used by other components. We show how this could be exposed
        to userspace, and how it can interact with the existing REDIRECT
        primitives, such as device maps.

        Finally if there is time, we would like to discuss a related problem:
        often a userspace program wants to express its configuration not in
        terms of queue IDs, but in terms of a set of packets it wants to
        process (e.g., by specifying an IP address). So how do we change user
        space APIs that use queue IDs to be able to use something more
        meaningful such as properties of the packet flow that a user wants? To
        solve this second problem, we propose to introduce a new bind option
        in AF_XDP that takes a simple description of the traffic that is
        desired (e.g. "VLAN ID 2", "IP address fc00:dead:cafe::1", or "all
        traffic on a netdev"). This hides queue IDs from userspace, but will
        use the new queue logic internally to allocate and configure an
        appropriate queue.

      • 16:30
        Break 30m
      • 17:00
        Seamless transparent encryption with BPF and Cilium 45m

        Providing encryption in dynamic environments where nodes are added and removed on-the-fly and services spin-up and are then torn-down frequently, such as Kubernetes, has numerous challenges. Cilium, an open source software package for providing and transparently securing network connectivity, leverages BPF and the Linux encryption capabilities to provide L3/L7 encryption and authentication at the node and service layers. Giving users the ability to apply encryption either to entire nodes or on specified services. Once configured through a high level feature flag (--enable-encrypt-l3, --enable-encrypt-l7) the management is transparent to the user. Cilium will manage and ensure traffic is encrypted allowing for auditing of encrypted/unencrypted flows via a monitoring interface to ensure compliance.

        In this talk we will show how Cilium accomplishes this in the Linux datapath and control plane. As well as discuss how Cilium with Linux and BPF fits into the evolving encryption standards and frameworks such as IPsec, mTLS, Secure Production Identity Framework For Everyone (SPIFFE), and Istio. Looking forward we propose a set of extensions to the Linux kernel, specifically to the BPF infrastructure, to ease the adoption and improve the efficiency of these protocols. Specifically, we will look at a series of BPF helpers, possible hardware support, scaling to thousands of nodes, and transparently enforcing policy on encrypted sessions.

        Finally to show this is not mere slide-ware we will show a demo Cilium implementing transparent encryption.

    • 10:00 13:30
      Open Printing MC

      The Open Printing (OP) organisation works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and Unix-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.

      We maintain cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. Open Printing also maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.

      Today it is very hard to think about printing in UNIX based OSs without the involvement of Open Printing. Open Printing has been successful in implementing driverless printing following the IPP standards proposed by the PWG as well.

      Proposed Topics:

      Working with SANE to make IPP scanning a reality. We need to make scanning work without device drivers similar to driverless printing.
      Common Print Dialog Backends.
      Printer/Scanner Applications - The new format for printer and scanner drivers. A simple daemon emulating a driverless IPP printer and/or scanner.
      The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service. Controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
      3D Printing without the use of any slicer. A filter that can convert a stl code to a gcode.
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Till Kamppeter ( ) or Aveek Basu (

      • 10:00
        Printing in Linux as of today 20m

        Today’s is a scenario when we can not think of having either a mobile phone or a laptop or a tablet. With the progress of technology and having all these handheld devices, we have been able to get many of our documents digitized. However, whatever advancements we see in this space of documentation, it is still very hard to find someone who did not have the need to print or scan a hard copy. Even today a critical agreement gets signed over a hard copy so do most of our banking documents or promo advertisements in a supermarket.

        The OpenPrinting (OP) organization works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and UNIX-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.
        We maintains cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. OpenPrinting maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.

        OpenPrinting has been doing a commendable job in improving the way the world prints on a UNIX based system. The projects that we maintain are taken up by almost all the Linux distributions and most recently Google Chrome OS. It is also used by most of the printer manufacturers to support printing. Today it is very hard to think about printing in these OSs without the involvement of OpenPrinting. We have been successful in implementing the driverless printing following the IPP standards proposed by the PWG. Because of that, today someone can think of printing from a Linux box by just connecting a printer over network or USB. Now using a printer has become as simple as using a thumb drive.

        A short showcase on printing in Linux.

        Speakers: Aveek Basu, Till Kamppeter
      • 10:20
        Common Print Dialog Backends 40m

        The OpenPrinting project “Common Print Dialog Backends” provides a D-Bus interface to separate the print dialog GUI from the communication with the actual printing system (CUPS, Google Cloud Print, e.t.c.) having each printing system being supported with a backend and these GUI-independent backends working with all print dialogs (GTK/GNOME, Qt/KDE, LibreOffice, e.t.c.). This allows for easily updating all print dialogs when something in a print technology changes, as only the appropriate backend needs to get updated. Also new print technologies can get easily introduced by adding a new backend.
        For quickly getting this concept into the Linux distributions we need these important tasks to be done.
        The CUPS backend tells the print dialog only about printer-specific user-settable options, not about general options implemented in CUPS or cups-filters and so being available for all print queues. These are options like N-up, reverse order, selected pages, e.t.c. as they are only common for CUPS and not necessarily available with other print technologies like Google Cloud Print, they should get reported to the print dialog by the CUPS backend.
        A print dialog should allow to print into a (PDF) file. This should be implemented in a new print dialog backend. [DONE:]
        As it will take time until GTK4 with its new print dialog is out, we should get support for the new Common Print Dialog Backends concept for the current GTK3 dialog. As this dialog has its own backend concept one simply would need an “adapter” backend to get from the old concept to the new, common concept.
        [Qt print dialog integration]

        Speakers: Rithvik Patibandla, Till Kamppeter
      • 11:00
        Working with SANE to make IPP scanning a reality 30m

        Printing at today’s date has progressed a lot and the world is already utilising the benefits of driverless printing. In today’s scenario it is very hard to think of a printer without a scanner. But unfortunately a technology like driverless scanning has yet to see the light of the day. In today’s date you cannot think of using a scanner without a scanner driver. We want to discuss more on this and what needs to be done to get rid of this problem.

        Version 2.0 and newer of the Internet Printing Protocol (IPP) supports polling the full set of capabilities of a printer and if the printer supports a known Page Description Language (PDL), like PWG Raster, Apple Raster, PCLm, or PDF, it is possible to print without printer-model-specific software (driver) or data (PPD file), so-called “driverless” printing. This concept was introduced for printing from smartphones and IoT devices which do not hold a large collection of printer drivers. Driverless printing is already fully supported under Linux. Standards following this scheme are IPP Everywhere, Apple AirPrint, Mopria, and Wi-Fi Direct Print. As there are many multi-function devices (printer/scanner/copier all-in-one) which use the IPP, the Printing Working Group (PWG) has also worked out a standard for IPP-based scanning, “driverless” scanning, to also allow scanning from a wide range of client devices, independent of which operating systems they are running. Conventional scanners are supported under Linux via the SANE (Scanner Access Now Easy) system and require drivers specific to the different scanner models. Most of them are written based on reverse-engineering due to lack of support by the scanner manufacturers. To get driverless scanning working with the software the users are used to the best solution is to write a SANE module for driverless IPP scanning. This module will then automatically support all IPP scanners, thousands of scanners where many of them do not yet exist.
        Another application for driverless IPP scanning is sharing local scanners which are accessed with SANE. Instead of the SANE frontend being a UI, either command line or graphical, it could be a daemon which emulates an IPP scanner on the network, executing the client’s scan requests on the local scanner.
        This way the client only needs to support IPP scanning, no driver for the actual scanner is needed and the client can be of any operating system or device type, including mobile phones, tablets, IoT, e.t.c.

        Speaker: Aveek Basu
      • 11:30
        Break 30m


      • 12:00
        Printer/Scanner Applications - The new format for printer and scanner drivers 30m

        The upstream author of CUPS has deprecated the classic way to implement printer drivers, describing the printer's capabilities in PPD (PostScript Printer Description) files and providing filters to convert standard PDLs (Page Description Languages) into the printer's own, often proprietary data format. With the background of PostScript not being the standard PDL any more, most modern (even the cheapest) printers being IPP driverless printers (using standard PDLs and printer's capabilities can get polled from the printer via IPP), and modern systems using sandboxed application packaging (Snappy, Flatpak, e.t.c.) the new Printer Application concept got introduced.
        A Printer Application is a (simple) daemon emulating a driverless IPP printer (can be in the local network but also simply on localhost). Like a physical printer this daemon advertises itself via DNS-SD, takes get-printer-attributes IPP requests and answers with printer capability info so that the client can create a local print queue pointing to it, takes print jobs, converts them to the physical printer's data format and sends them off to the printer.
        This way the client "sees" a driverless IPP printer and the Printer Application is the printer driver (printer-model-specific software to make the printer work). So with the driver being connected to the system's printing stack only via IP and no consisting of files spread into directories of the printing stack, both the printing stack and the driver can be in separate, sandboxed applications, provided as sandboxed packages in the app stores of the appropriate packaging systems (Snappy, Flatpak, e.t.c.). And this allows the driver not depending on a specific operating system distribution any more. A printer manufacturer only needs to make a driver "for Snappy", not for Ubuntu Desktop/Server, Ubuntu Core, Red Hat, SUSE, e.t.c. making development and testing much easier and cheaper.
        And one can even go further: As the Printer Working Group (PWG) also has created an IPP driverless scanning standard, we can create Scanner Applications emulating a driverless IPP scanner and internally using scanner drivers, like SANE, to communicate with the scanner, allowing the same form of OS-distribution-independent sandboxed driver packages for ANY scanner, especially also stand-alone scanners without printing engine.
        For multi-function printers one could also have a combined Printer/Scanner application. Any such Printer and/or Scanner Application can even provide an IPP System Service interface to allow configuring the driver without need of specialized GUI applications on the client.
        We have a Google Summer of Code student working on a framework for Printer Applications, to convert classic printer drivers into Printer Applications to kick off the new standard.
        In this session we will present the new format, its integration into real life systems, problems we got into during the work with our student, and how to present it to hardware manufacturers as the new way to go.

        Speaker: Till Kamppeter
      • 12:30
        The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service 30m

        Very common in the daily life of computer users are printer setup tools, these GUI applications where you configure a queue for a new printer which you want to use. You select the printer from auto-detected ones and choose a driver for it, nowadays it gets rather common that the driver is selected automatically. You also set option defaults, like Letter/A4, print quality, …
        With the advent of driverless IPP printers and automatic setup of network printers the classic printer setup tool gets less important. Especially one sees this on smartphones and tablets which do not even have a printer setup tool and option settings and default printers are selected in the print dialogs.
        But this does not mean that the time of printer setup tools is over, especially in larger networks they can help getting an overview of the available printers, controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
        Also the printers itself could be configured with a printer setup tool when they support the new IPP System Service standard, an interface which allows remote administration of IPP network printers, similar to what you can do with the printer's web interface but with a standardized client GUI.
        In this session we will talk about new possibilities in printer setup tools and their implementation. Ideas are:
        Client GUI for IPP System Service - Administration of network printers
        Configuring cups-browsed - GUI for printer list filtering, printer clustering, …
        Configuring Common Print Dialog Backends
        More ideas are naturally welcome.

        Speaker: Till Kamppeter
      • 13:00
        3D Printing without the use of any slicer. 30m

        Currently to print an stl model in a 3D printer the same needs to be sliced first into a gcode to be understandable by a 3D printing software. In Linux we do not have any filter that can convert a stl code to a gcode. First we plan to discuss on what is the current scenario and then what can we do to fit in Linux.

        Speaker: Aveek Basu
    • 10:00 13:30
      Real Time MC

      Since 2004 a project has improved the Real-time and low-latency features for Linux. This project has become know as PREEMPT_RT, formally the real-time patch. Over the past decade, many parts of the PREEMPT RT became part of the official Linux code base. Examples of what came from PREEMPT_RT include: Real-time mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The number of patches that need integration has been reduced from previous years, and the pieces left are now mature enough to make their way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged (tm)!

      In the final lap of this race, the last patches are on the way to be merged, but there are still some pieces missing. When the merge occurs, PREEMPT_RT will start to follow a new pace: the Linus one. So, it is possible to raise the following discussions:

      The status of the merge, and how can we resolve the last issues that block the merge;
      How can we improve the testing of the -rt, to follow the problems raised as Linus's tree advances;
      What's next?
      Proposed topics:

      Real-time Containers
      Proxy execution discussion
      Merge - what is missing and who can help?
      Rework of softirq - what is need for the -rt merge
      An in-kernel view of Latency
      Ongoing work on RCU that impacts per-cpu threads
      How BPF can influence the PREEMPT_RT kernel latency
      Core-schedule and the RT schedulers
      Stable maintainers tools discussion & improvements.
      Improvements on full CPU isolation
      What tools can we add into tools/ that other kernel developers can use to test and learn about PREEMPT_RT?
      What tests can we add to tools/testing/selftests?
      New tools for timing regression test, e.g. locking, overheads...
      What kernel boot self-tests can be added?
      Discuss various types of failures that can happen with PREEMPT_RT that normally would not happen in the vanilla kernel, e.g, with lockdep, preemption model.
      The continuation of the discussion of topics from last year's microconference, including the development done during this (almost) year, are also welcome!

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Daniel Bristot de Oliveira

    • 15:00 18:30
      Android MC

      Building on the Treble and Generic System Image work, Android is
      further pushing the boundaries of upgradibility and modularization with
      a fairly ambitious goal: Generic Kernel Image (GKI). With GKI, Android
      enablement by silicon vendors would become independent of the Linux
      kernel running on a device. As such, kernels could easily be upgraded
      without requiring any rework of the initial hardware porting efforts.
      Accomplishing this requires several important changes and some of the
      major topics of this year's Android MC at LPC will cover the work
      involved. The Android MC will also cover other topics that had been the
      subject of ongoing conversations in past MCs such as: memory, graphics,
      storage and virtualization.

      Proposed topics include:

      Generic Kernel Image
      ABI Testing Tools
      Android usage of memory pressure signals in userspace low memory killer
      Testing: general issues, frameworks, devices, power, performance, etc.
      DRM/KMS for Android, adoption and upstreaming dmabuf heaps upstreaming
      dmabuf cache managment optimizations
      kernel graphics buffer (dmabuf based)
      uid stats
      vma naming
      vitualization/virtio devices (camera/drm)
      libcamera unification
      These talks build on the continuation of the work done last year as reported on the Android MC 2018 Progress report. Specifically:

      Symbol namespaces have gone ahead
      There is continued work on using memory pressure signals for uerspace low memory killing
      Userfs checkpointing has gone ahead with an Android-specific solution
      The work continues on common graphics infrastructure
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Karim Yaghmour, Todd Kjos, Sandeep Patil, and John Stultz

      • 15:00
        Generic Kernel Image (GKI) progress 15m

        A year ago at Linux Plumbers, we talked about a generic Android kernel that boots
        and runs reasonably well on any Android device. This talk shares the progress we've made so far on many fronts. A summary of those work streams, problems we discovered along the way and our plans for them. We will talk about our short term goals and long term vision to get Android device kernels as close to the mainline as possible.

        Speaker: Sandeep Patil (Google)
      • 15:15
        Monitoring and Stabilizing the In-Kernel ABI 15m

        The Kernel's API and ABI exposed to Kernel modules is not something
        that is usually maintained in upstream. Deliberately. In fact, the
        ability to break APIs and ABIs can greatly benefit the development.
        Good reasons for that have been stated multiple times. See e.g.
        The reality for distributions might look different though. Especially
        - but not exclusively - enterprise distributions aim to guarantee ABI
        stability for the lifetime of their released kernels while constantly
        consuming upstream patches to improve stability and security for said
        kernels. Their customers rely on both: upstream fixes and the ability
        to use the released kernels with out-of-tree modules that are compiled
        and linked against the stable ABI.

        In this talk I will give a brief overview about how this very same
        requirement applies to the Kernels that are part of the Android
        distribution. The methods presented here are reasonable measures to
        reduce the complexity of the problem by addressing issues introduced
        by ABI influencing factors like build toolchain, configurations, etc.

        While we focus on Android Kernels, the tools and mechanisms are
        generally useful for Kernel distributors that aim for a similar level
        of stability. I will talk about the tools we use (like e.g.
        libabigail), how we automate compliance checking and eventually
        enforce ABI stability.

        Speaker: Matthias Männich (Google)
      • 15:30
        Solving issues associated with modules and supplier-consumer dependencies 15m

        GKI or any ARM64 Linux distro needs a single ARM64 kernel that works across all SoCs. But having a single ARM64 kernel that works across all SoCs has a lot of hurdles. One of them, is getting all the SoC specific devices to be handed off cleanly from the bootloader to the kernel even when all their drivers are loaded as modules. Getting this to work correctly involves proper ordering of events like module loading, device initialization and device boot state clean up. This discussion is about the work that's being done in the upstream kernel to automate and facilitate the proper ordering of these events.

        Speaker: Saravana Kannan (Google)
      • 15:45
        Android Virtualization (esp. Camera, DRM) 15m

        An update on how we plan to enable multimedia testing on our 'cuttlefish' virtual platform. Overview of missing components for graphics virtualization.

        Speaker: Alistair Delva (Google)
      • 16:00
        libcamera: Unifying camera support on all Linux systems 15m

        The libcamera project was started at the end of 2018 to unify camera support on all Linux systems (regular Linux distributions, Chrome OS and Android). In 9 months it has produced an Android Camera HAL implementing the LIMITED profile for Chrome OS, and work is in progress to implement the FULL profile. Two platforms are currently supported (Intel IPU3 and Rockchip ISP), with work on additional platforms ongoing.

        First-class Android support doesn't only depend on the effort put on libcamera, but requires cooperation with the Android community and industry. In particular, libcamera has reached a point where it needs to discuss the following topics:

        • Feedback from the Android community on the overall architecture
        • Feedback from SoC vendors on the device-specific interfaces and device support in general
        • Next development steps for libcamera to support the LEVEL 3 profile
        • Contribution of libcamera to Project Treble and integration in AOSP
        • Future of the Android Camera HAL API and feedback from libcamera team

        Discussions regarding the shortcomings of the Linux kernel APIs for Android camera support, and how to address them, is also on-topic as libcamera suffers from the same issues.

        As the Linux Plumbers Conference will gather developers from the Google Android teams, from the Android community, from the Linux kernel media community and from the libcamera project, we strongly believe this is a unique occasion to design the future of camera support in Linux systems all together.

        Speaker: Laurent Pinchart (Ideas on Board Oy)
      • 16:15
        Emulated storage features (eg sdcardfs) 15m

        Update and discussion of emulated storage on Android

        Speaker: Daniel Rosenberg (Google)
      • 16:30
        Eliminating WrapFS hackery in Android with ExtFUSE (eBPF/FUSE) 15m

        This work proposes to adopt Extended FUSE (ExtFUSE) framework for improving the performance of Android SDCard FUSE daemon, thereby eliminating a need for out-of-tree WrapFS hackery in the Android kernel.

        ExtFUSE leverages eBPF framework for developing extensible FUSE file systems. It allows FUSE daemon in Android to register “thin” eBPF handlers that can serve metadata as well as data I/O file system requests right in the kernel to improve performance. Our evaluation with Android SDCardFS under ExtFUSE shows about 90% improvement in app launch latency with less than thousand lines of eBPF code in the kernel. In the presentation, I will share my findings and progress made to get feedback from the Android kernel developers.

        Overall, this work benefits millions of Android devices that are currently running out-of-tree WrapFS-based code in the kernel for emulating FAT functionality and enforcing custom security checks.

      • 16:45
        Break 15m
      • 17:00
        Linaro Kernel Functional Testing (LKFT): functional testing of android common kernels 15m

        As part of the Android Microconference:

        Linux Kernel Functional Test is a system to detect kernel regressions across the range of mainline, LTS and Android Common kernels. It is able to run a variety of operating systems from Linux to Android across an array of systems under test. You're probably thinking in terms of standard test suites like CTS, VTS, LTP, kselftest and so on and you're be right. We'll talk about how things have been going over the past year and some of the challenges face when testing at scale.

        The 'F' in LKFT is for Functional, and during this interactive session we will explore how to continue to make strides beyond pass/fail tests. Kernel regressions aren't just an option that once worked now is failing. They also include degradation in performance. The session will explore the recent add to LKFT involving the Energy Aware Scheduler (EAS) with boards that have power probes on hardware. Last we'll talk about audio and some things we've been exploring with testing the audio stack on Android.

        Speaker: Tom Gall (Linaro)
      • 17:15
        How we're using ebpf in Android networking 15m

        A short update on eBPF in Android networking:

        • how we're using ebpf in Android P on 4.9+ for statistics collection
          and Q on 4.9+ for xlat464 offload, with a focus on the sorts of
          problems we've run into
        • where we'd like to go, ie. future plans with regard to xlat464/forwarding/nat
          offload and XDP.
      • 17:30
        Handling memory pressure on Android 15m

        Topic will discuss how Android framework utilizes new kernel features
        to better handle memory pressure. This includes app compaction, new
        kill strategies and improved process tracking using pidfds.

        Speaker: Suren Baghdasaryan (Google)
      • 17:45
        DMABUF Developments 15m

        To discuss recent developments and directions with DMABUF:
        DMABUF Heaps/ION destaging
        Better DMABUF ownership state machine documentation
        DMABUF cache maintenance optimizations
        Kernel graphics buffer idea

        Speakers: Sumit Semwal, John Stultz (in absentia)
      • 18:00
        DRM/KMS for Android, adoption and upstreaming 15m

        A short update on the status of DRM/KMS ecosystem adoption and how Google is improving verification of the DRM display drivers in Android devices.

        Speaker: Alistair Delva (Google)
      • 18:15
        scheduler: uclamp usage on Android 15m

        Android has been using an out-of-tree schedtune cgroup controller for
        task performance boosting of time-sensitive processes. Introduction of
        utilization clamping (uclamp) feature in the Linux kernel opens up an opportunity to adopt an upstream mechanism for achieving this goal. The talk will present our plans on adopting uclamp in Android.

        Speaker: Suren Baghdasaryan (Google)
    • 15:00 20:00
      Containers and Checkpoint/Restore MC

      The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

      Last year's edition covered a range of subjects and a lot of progress has been made on all of them. There is a working prototype for an id shifting filesystem some distributions already choose to include, proper support for running Android in containers via binderfs, seccomp-based syscall interception and improved container migration through the userfaultfd patchsets.

      Last year's success has prompted us to reprise the microconference this year. Topics we would like to cover include:

      Android containers
      Agree on an upstreamable approach to shiftfs
      Securing containres by rethinking parts of ptrace access permissions, restricting or removing the ability to re-open file descriptors through procfs with higher permissions than they were originally created with, and in general how to make procfs more secure or restricted.
      Adoption and transition of cgroup v2 in container workloads
      Upstreaming the time namespace patchset
      Adding a new clone syscall
      Adoption and improvement of the new mount and pidfd APIs
      Improving the state of userfaultfd and its adoption in container runtimes
      Speeding up container live migration
      Address space separation for containers
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Stéphane Graber, Christian Brauner, and Mike Rapoport

      • 15:00
        Opening session 10m
        Speaker: Stéphane Graber (Canonical Ltd.)
      • 15:10
        CRIU and the PID dance 20m

        CRIU only restores processes with the same PID the processes used to have during checkpointing. As there is no interface to create a process with a certain PID like fork_with_pid() CRIU does the PID dance to restore the process with the same PID as before checkpointing.

        The PID dance consists of open()ing /proc/sys/kernel/ns_last_pid, write()ing PID-1 to /proc/sys/kernel/ns_last_pid and close()ing it. Then CRIU does a clone() and a getpid() to see if the clone() resulted in the desired PID. If the PID does not match, CRIU aborts the restore.

        This PID dance is slow, racy and requires CAP_SYS_ADMIN.

        Fortunately the newly introduced clone3() offers the possibility to be extended to support clone3() with a certain/desired PID. There are currently (July 2019) discussions how to extend clone3() to be able to use it with a certain PID. By the time LPC has started these patches will probably be already posted. With these patches it should be possible to solve the problems that the PID dance is slow and racy.

        Which leaves the problem of CAP_SYS_ADMIN. This is a problem for CRIU because it is the major reason why CRIU needs to be run as root during restore. If the root and CAP_SYS_ADMIN requirement could be somehow relaxed it would solve the problems for people running CRIU as non-root for container migration as reported during last year's LPC and it would also open up easy CRIU usage in areas like HPC with MPI based checkpointing and restoring running as non-root.

        In this talk we want to give some background how and why CRIU does the PID dance, we want to present our changes based on clone3() to be able to create a process with a certain PID. Then we would like to get feedback from the community if a rootless restore is important and how to relax the CAP_SYS_ADMIN requirement and how this relaxation could be implemented.

        Speaker: Adrian Reber (Red Hat)
      • 15:30
        Address Space Isolation for Container Security 15m

        Containers are generally percieved less secure than virtual
        machines. Without going into a theological argument about the actual
        state of the affairs, we suggest to explore the possibility of using
        address space isolation inside the kernel to make containers even more

        Assuming that kernel bugs and therefore vulnerabilities are inevitable
        it is worth isolating parts of the kernel to minimize damage that
        these vulnerabilities can cause.

        One way to create such isolation is to assign an address space to the
        Linux namespaces, so that tasks running in namespace A have different
        view of kernel memory mappings than the tasks running in namespace B.

        For instance, by keeping all the objects in a network namespace
        private, we can achieve levels of isolation equivalent to running a
        separated network stack.

        Another possible usecase is isolating address spaces for different
        user namespaces.

        Beside marrying namespaces with address spaces we also considering
        implementaiton of isolated memory mappings using mmap()/madvise() so
        that a region of the caller's memory would be hidden from the rest of
        the system.

        We are going to give a short update on current status of our research
        and we are going to discuss implications of the address space
        isolation and possible future directions:

        • What are the trade-offs between letting user-space to control the
          isolation or keeping the control completely in-kernel.

        • What should be user-visible interface for address space management?
          Does it need to be on/off switch at kernel command line or do we
          need runtime knobs for that? Or maybe even "address space namespace"
          or "address space cgroup"?

        • How can we evaluate the security improvements beyond empiric
          obvservation that when less code and data are mapped, there are less
          vulnerabilities exposed?

        Speakers: Mike Rapoport, James Bottomley (IBM)
      • 15:45
        Seccomp Syscall Interception 15m

        Recently the kernel landed seccomp support for SECCOMP_RET_USER_NOTIF which enables a process (watchee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee.

        We have integrated this feature into userspace and currently make heavy use of this to intercept mknod() syscalls in user namespaces aka in containers.
        If the mknod() syscall matches a device in a pre-determined whitelist the privileged watcher will perform the mknod syscall in lieu of the unprivileged watchee and report back to the watchee on the success or failure of its attempt. If the syscall does not match a device in a whitelist we simply report an error.

        This talk is going to show how this works and what limitations we run into and what future improvements we plan on doing in the kernel.

        Speaker: Mr Christian Brauner
      • 16:00
        Update on Task Migration at Google Using CRIU 30m

        Over the last year we have worked on expanding the task migration using CRIU in Google. The talk will discuss how in some cases the kernel interfaces are lacking for the purpose of migration:

        • Lack of support for reading rseq configuration which means that it requires userspace support to migrate users of rseq properly.
        • Lack of support for reading what cgroup events the users have registered for.
        • Many kernel C/R interfaces are protected by CAP_SYS_ADMIN which we deemed unsafe to have for the migrator agent - CAP_RESTORE could be the solution.

        We will discuss new challenges which we have encountered while developing the migration technology further:

        • The lack of clean error classification in CRIU forced us to parse the migration logs.
        • Lack of support for some less often used kernel features in CRIU (e.g. O_PATH, PR_SET_CHILD_SUBREAPER).
        • Migrating containers while also changing the IP of the container is hard but in many cases could be done with little effort on the library or user side.
        • We have finalized streaming migration support on our side and in the process we have realized that the hitless migration is infeasible for our latency sensitive users.
        Speaker: Kamil Yurtsever (Google)
      • 16:30
        Break 15m
      • 16:45
        Secure Image-less Container Migration 15m

        Container runtimes, engines and orchestrators provide a production-grade, robust, high-performing, but also relatively self-managing, self-healing infrastructure using innovative open-source technologies.

        CRIU allows the running state of containerised applications to be preserved as a collection of files that can be used to create an equivalent copy of the applications at a later time, and possibly on a different system.

        However, for a live migration mechanism to be effective it is very important to minimize the down-time of these applications without compromising security. Therefore, in this talk we discuss new features of CRIU that enable seamless live migration based on direct communication mechanism between source and destination nodes, in order to avoid the generation of intermediate image files and to keep only necessary state information cached in memory.

        Speakers: Mr Radostin Stoyanov (University of Aberdeen), Dr Martin Kollingbaum (University of Aberdeen)
      • 17:00
        Using kernel keyrings with containers 30m

        The kernel contains a keyrings facility for handling tokens for filesystems and other kernel services to use. These are frequently disabled for container environments, however, because they were not made namespace aware by the authors of the user-namespace and others.

        Unfortunately, this lack prevents various things from working inside containers. To get around this, keys are now being tagged with a namespace tag that allows keys operating in different namespaces to coexist in the same keyring and restrictions have been placed on joining session keyrings across namespaces.

        This still isn't sufficient to make them truly useful here. Intended future developments include: granting a permit to use a key to a container; adding per-container keyrings; request-key upcall namespacing.

        Speaker: Mr David Howells (Red Hat)
      • 17:30
        Can we agree on what needs to happen to get shiftfs upstream 30m

        Since Canonical is now shipping it I think we can all agree it solves a problem and we just need to get the patches into shape for upstream submission. Can we discuss a pathway for doing that.

        Speakers: James Bottomley (IBM), Christian Brauner, Mr Seth Forshee (Canonical)
      • 18:00
        Securing Container Runtimes with openat2 and libpathrs 30m

        Userspace has (for a long time) needed a mechanism to restrict path resolution. Obvious examples are those of FTP servers, Web Servers, archiving utilities, and now container runtimes. While the fundamental issue with privileged container runtimes opening paths within an untrusted rootfs was known about for many years, the recent CVEs (CVE-2018-15664 and CVE-2019-10152 being the most recent) to that effect has brought more light to the issue.

        This is an update on the work briefly discussed during LPC 2018, complete with redesigned patches and a new userspace library that will allow for backwards-compatibility on older kernels that don't have openat2(2) support. In addition, the patchset now has new semantics for "magic links" (nd_jump_link-style "symlinks") that will protect against several file descriptor re-opening attacks (such as CVE-2016-9962 and CVE-2019-5736) that have affected all sorts of container runtimes and other programs. It also provides the ability for userspace to further restrict the re-opening capabilities of O_PATH descriptors.

        In order to facilitate easier (safe) use of this interface, a new userspace library (libpathrs) has been developed which makes use of the new openat2(2) interfaces while also having userspace emulation of openat2(RESOLVE_IN_ROOT) for older kernels. The long-term goal is to switch the vast majority of userspace programs that deal with potentially-untrusted directory trees to use libpathrs and thus avoid all of these potential attacks.

        The important parts of this work (and its upstream status) will be outlined and then discussion will open up on what outstanding issues might remain.

      • 18:30
        Break 10m
      • 18:40
        Using the new mount API with containers 30m

        The Linux kernel has recently acquired a new API for creating mounts. This allows a greater range of parameter and parameter values to be specified, including, in the future, container-relevant information such as the namespaces that a mount should use.

        Future developments of this API also need to work out how to deal with upcalling from the kernel to gain parameters not directly supplied, such as DNS records, automount configurations or configuration overrides, whilst preventing namespacing violations through the upcall.

        Speaker: Mr David Howells (Red Hat)
      • 19:10
        Cgroup v1/v2 Abstraction Layer 20m


        We have cgroup v1 users who want to switch to cgroup v2, but there
        currently isn't an upstream migration story for them. (Previous
        LPC talks have focused on the issues of migrating from v1 to v2, but
        no substantial upstream solution has come to fruition.)

        The goal of this talk is to discuss the cgroup v1 to v2 migration
        path and gauge community interest in a cgroup v1/v2 abstraction

        Problem Statement

        Several Oracle products have very, very long product lifetimes and
        are designed to run on a wide range of Linux kernels and systemd
        versions. These products are encountering difficulties as cgroups
        continues to grow and change. Older kernels only support v1, but v2
        is the future in newer kernels with v1 effectively in maintenance mode.
        Newer versions of systemd have started to abstract the cgroup interface,
        but upgrading older systems to newer versions of systemd is often not
        feasible. Ultimately, long-lifespan products are spending an increasing
        and inordinate amount of time and effort managing their cgroup interfaces.

        There is interest within Oracle to create a cgroup abstraction layer
        that will allow long-lived products to utilize the most advanced
        cgroups features available on every supported system. Ideally these
        products will be able to rely upon a library to abstract away the
        low-level cgroup implementation details on that system.


        Anyone interested in cgroups

        Why Should the Audience Attend and/or Care

        • We would like to develop a cgroups abstraction layer in the next year
          or so. We would love to collaborate with others to build and design a
          solution that can help the entire community

        • Do other people/companies have interest in an abstraction layer? We
          want to hear other use cases and needs to better serve as many people
          as possible

        • Is there already something out there that we can utilize and build on?

        • Given the wide array of users and use cases, the library will likely
          need to have bindings for today's most popular languages - python, go
          java, etc.

        • There are a multitude of API possibilities. What level(s) of abstraction
          are of interest to the community? e.g.
          GiveMeCpus(cgname=foo, cpu_count=2, exclusive=True, numa_aligned=True, ...)
          CgroupCreate(cgname=foo, secure_from_sidechannel=True, ...)

      • 19:30
        CRIU: Reworking vDSO proxification, syscall restart 20m

        We have a number of unsolved time and vdso related issues in CRIU.

        • Syscall restart: if a task Checkpoint interrupted a syscall, on restore CRIU blindely starts again the syscall (executing SYSCALL/SYSENTER/INT80/etc instruction with the original regset). It works OKish, but not with time blocking syscalls i.e., poll(), nanosleep(), futex() and etc. For this purpose, Glibc and vDSO use restart_syscall(). Which won't work in CRIU as kernel is not aware of interrupted syscall. To solve those issues I suggest to extend PTRACE_GET_SYSCALL_INFO with information from task_struct->restart_block. This way on restore criu will be able to adjust syscall arguments on application Restore.
        • vDSO proxification. There is a chance that between Checkpoint and Restore events vDSO code may change. That may be in example, migration to another node or updating the kernel on the very same node. The old vDSO code can't be used anymore as vvar physical page can be missing [migration to an older kernel] or it may have different offsets. CRIU deals with that by mmaping old vdso code and patching entries with jumps to a new vdso. That's far from being perfect: the original application could have being Checkpointed while executing vdso code, but luckily we haven't got any reports about crashes on restore so far! Addressing this problem, we could add symbol table to vvar and got/plt tables to vdso, allowing CRIU to do linker job on restore by patching relocations on older vdso to newer vvar. The other approach would be making proxification process more correct: we could single-step application on Checkpoint from bytes that might be patched on Restore (JUMP_PATCH_SIZE). But additional trouble would be signals which may have being delivered while application was executing the very same bytes. That can be solved probably with hijacking SA_RESTORER..
        Speakers: Dmitry Safonov, Andrei Vagin
      • 19:50
        Closing session 10m
        Speaker: Stéphane Graber (Canonical Ltd.)
    • 15:00 20:00
      Power Management and Thermal Control MC

      The focus of this MC will be on power-management and thermal-control frameworks, task scheduling in relation to power/energy optimizations and thermal control, platform power-management mechanisms, and thermal-control methods. The goal is to facilitate cross-framework and cross-platform discussions that can help improve power and energy-awareness and thermal control in Linux.

      Prospective topics:

      CPU idle-time management improvements
      Device power management based on platform firmware
      DVFS in Linux
      Energy-aware and thermal-aware scheduling
      Consumer-producer workloads, power distribution
      Thermal-control methods
      Thermal-control frameworks
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Rafael J. Wysocki ( and Eduardo Valentin (

      • 15:00
        Multiple thermal zones representation 25m

        The current design of the thermal framework forces the usage of a governor with a thermal zone thus limiting the scope of the decisions.
        The question of the multiple thermal zones representation and how they are handled by a governor was put several times on the table but without a clear consensus.
        In order to go forward in this area, this MC topic proposes a simple design with a hierarchical thermal zones representation and how they can be managed by a governor. The design keeps the compatibility with the current flat representation.

      • 15:25
        Performance guarantees under thermal pressure 25m

        Performance capping due to thermal limitations is common scenario particularly in mobile systems. Today user-space has no information about what level of performance that can be expected worst case and SCHED_DEADLINE can admit reservations which are impossible to fulfill.
        The purpose of the this topic is to discuss what level guarantees the kernel should provide. Should the kernel have a platform specific or tunable sustained performance level?

        Speaker: Morten Rasmussen (Arm)
      • 15:50
        Task-centric thermal management 25m

        Thermally unsustainable compute demand is in most systems controlled by reducing performance through disabling performance states on specific CPUs or other devices in the system. It provides an efficient method to ensure the system doesn't overheat, however, it doesn't take the actual workload into account which could be better served if the performance caps were applied differently.
        The intention with this topic is to discuss the idea of controlling tasks, i.e. compute demand (potentially from user-space), instead of controlling devices directly.

        Speaker: Morten Rasmussen (Arm)
      • 16:15
        Improving producer-consumer type workload performance 25m

        When each CPU core can independently control its performance states, then there is performance loss on some benchmarks compared to the case when there are no independent performance states. There are couple of options to indicate to the cpufreq drivers when a producer thread wakes a consumer thread: One sending some hints like we do for IO boost or give boost PELT utilization. But there is a challenge in cleanly identifying a producer/consumer relationship in scheduler code. There are several ways a thread can wait and get signaled to wake in Linux.
        They don't end up in one place in scheduler code to cleanly implement. I experimented a case where futex are used between producer and consumers, where a hint is passed when cpufreq drivers to give small boost.
        The idea here is to discuss:
        - Shall we solve this problem?
        - How to unify wait and wake up functions?
        - Is it better to give a hint or boost PELT utilization of the consumer?

        Speaker: Srinivas Pandruvada
      • 16:40
        Break 20m
      • 17:00
        Device power management based on platform firmware 25m

        Continuing the attempts to reducing fragmentation in power management on ARM platforms, there are discussions if something similar to ACPI can be done.i.e. device centric power management.

        Currently, a device has power, performance, reset, and clock domains associated with it. SCMI provides interface to deal with these domains directly. This was simpler approach to start with the SCMI specification to keep the OSPM related changes minimal. So for a given device it's power, performance, reset, clock,...etc domains need to be known and appropriate requests should be made on those domains when needed. Since this list seem to ever growing on ARM platforms, like pinmux, gpio, iomux,...etc, the current approach is not sustainable for long.

        Instead of this, there's a thought on making these device centric and drive it.
        So OSPM need not care which power/perf/reset/clock domain it belongs. All the details are abstracted from OSPM completely.

        This talk is to discuss and understand where how to drive this platform firmware based device power management from Linux kernel. Which existing subsystem to reuse ?

        Speaker: Mr Sudeep Holla (ARM)
      • 17:25
        Taking suspend/resume validation to the next level 25m

        At LPC 2015, we introduced analyze_suspend, a new open source tool to show where the time goes during Linux suspend/resume. Now called "sleepgraph", it has evolved in a number of ways over the last four years. Most importantly, it is now the core of a framework that we use for suspend/resume endurance testing.

        Endurance testing has allowed us to identify, track, report and sometimes fix issues that developers used to dismiss as "unreproducible".

        But to improve Linux suspend/resume quality further, we need more people testing different machines and reporting bugs. This is an appeal for ideas how the power of the broader open source community can be harnessed to improve Linux suspend/resume quality.

      • 17:50
        C-state latency measurement infrastructure 25m

        We in Intel developed instrumentation for measuring C-state wake latency. The instrumentation, which we call "waltr" (WAke up Latency Tracer) consists of user-space and kernel modules parts.

        In principle, waltr works by scheduling delayed interrupts and measuring the wake latency close to the x86 'mwait' x86 instruction. This requires an external device equipped with high precision clock and capable of delayed interrupts. We have been
        using the Intel i210 Ethernet adapter for these purposes. But theoretically this
        could be a completely different device, e.g., a GFX card.

        The C-state latency measurement instrumentation should be very useful for the open-source community and we would like to upstream the kernel parts of it. We are seeking for feedback on how to properly modify the kernel in a maintainable and reusable way, to benefit everyone.

        Here are few examples for the dilemmas have.
        How do we design a framework for compliant devices like the i210 adapter?
        What would be the right user-space API for the delayed interrupts provider?
        * How do we take snapshots of C-state counters and deliver them to user-space?

        I am asking for a 20-30 minutes time-slot. And I am hoping to talk to people more about this in hallway discussions.

      • 18:15
        CPU Idle Time Management Improvements 25m

        There are some improvements in the CPU idle time management to be made, like switching over to using time in nanoseconds (64-bit), reducing overhead and some governor modifications (including possible deprecation of the menu governor) which need to be discussed.

        Speaker: Rafael Wysocki (Intel Open Source Technology Center)
      • 18:40
        Break 20m
      • 19:00
        Power Management and Thermal Control BoF Sessions 1h
    • 10:00 17:45
      Kernel Summit Track
      • 11:30
        Break 30m
      • 13:30
        Lunch 1h 30m
      • 16:30
        Break 30m
    • 10:00 17:45
      LPC Refereed Track
      • 10:00
        Finding more DRAM 45m

        The demand of DRAM across different platforms is increasing but the cost is not decreasing. Thus DRAM is a major factor of the total cost across all kinds of devices like mobile, desktop or servers. In this talk we will be presenting the work we are doing at Google, applicable to Android, Chrome OS and data center servers, on extracting more memory out of running applications without impacting performance.

        The key is to proactively reclaim idle memory from the running applications. For the Android and Chrome OS, the user space controller can provide hints of the idle memory at the applications level while the servers running multiple workloads, an idle memory tracking mechanism is needed. With such hints the kernel can proactively reclaim memory given that estimated refault cost is not high. Using in-memory compression or second tier memory, the refault cost can be reduced drastically.

        We have developed and deployed the proactive reclaim and idle memory tracking across Google data centers [1]. Defining idle memory as memory not accessed in the last 2 mins, we found 32% idle memory across data centers and we were able to reclaim 30% of this idle memory, while not impacting the performance. This results in 3x cheaper memory for our data centers. 98% of the applications spend only around 0.1% of their CPU on memory compression and decompression. Also the idle memory tracking on average takes less than 11% of a single logical CPU.

        The cost of proactive reclaim and idle memory tracking is reasonable for the data centers cost of ownership of memory, however, it imposes challenges for power constrained devices based on Android and Chrome OS. These devices run diverse applications e.g. Chrome OS can run Android and Linux in a VM. To that end, we are working on making idle memory tracking and proactive reclaim feasible for such devices. Henceforth, we are interested and would like to initiate discussion on making proactive reclaim useful for other use-cases as well.

        [1] Software-Defined Far Memory in Warehouse-Scale Computers, ACM ASPLOS 2019.

        Speakers: Shakeel Butt (Google), Suren Baghdasaryan (Google), Yu Zhao (Google)
      • 10:45
        Linux Gen-Z Sub-system 45m

        Gen-Z Linux Sub-system

        Discuss design choices for a Gen-Z kernel sub-system and the challenges of supporting the Gen-Z interconnect in Linux.

        Gen-Z is a fabric interconnect that connects a broad range of devices from CPUs, memory, I/O, and switches to other computers and all of their devices. It scales from two components in an enclosure to an exascale mesh. The Gen-Z consortium has over 70 member companies and the first version of the specification was published in 2018. Past history for new interconnects suggests we will see actual hardware products two years after the first specification - in 2020. We propose to add support for a Gen-Z kernel sub-system, a Gen-Z component device driver environment, and user space management applications.

        A Gen-Z sub-system needs support for these Gen-Z features:

        • Registration and enumeration services that are similar to existing
          sub-systems like PCI.
        • Gen-Z Memory Management Unit (ZMMU) provides memory mapping and access to fabric addresses. The Gen-Z sub-system can provide services to track PTE entries for the two types of ZMMU's in the specification: page grid and page table based.
        • Region Keys (R-Keys) - Each ZMMU page can have R-Keys used to validate page access authorization. The Gen-Z sub-system needs to provide APIs for tracking, freeing, and validating R-Keys.
        • Process Address Space Identifier (PASID) - ZMMU requester and responder Page Table Entries (PTEs) contain a PASID. The Gen-Z sub-system needs to provide APIs for tracking PASIDs.
        • Data mover - Transmit and receive data movers are optional elements in bridges and other Gen-Z components. The Gen-Z sub-system can provide a user space interface to a RDMA driver that uses a Gen-Z data mover. For example, a libfabric Gen-Z provider implementation can use a RDMA driver to access data mover queues.
        • UUIDs - Components are identified by UUIDs. The Gen-Z sub-system provides interfaces for tracking UUIDs of local and remote components. A Gen-Z driver binds to a UUID similarly to how a PCI driver binds to a vendor/device id.
        • Interrupt handling - Interrupt request packets in Gen-Z trigger local interrupts. Local components such as bridges and data movers can also be sources of interrupts.

        We will discuss our proposed design for the Gen-Z sub-system illustrated in the following block diagram:

        Gen-Z Sub-system Block Diagram

        Gen-Z fabric management is global to the fabric. The operating system may not know what components on the fabric are assigned to it; the fabric manager decides which components belong to the operating system. Although user space discovery/management is unusual for Linux, it will allow the Gen-Z sub-system to focus on the mechanism of component management rather than the policy choices a fabric manager must make.

        To support user space discovery/management, the Gen-Z sub-system needs interfaces for management services:

        • Fabric managers need read/write access to component control space in order to do fabric discovery and configuration. We propose using /sys files for each control structure and table.
        • User space Gen-Z managers need notification of management events/interrupts from the Gen-Z fabric. We propose using poll on the bridges' device files to communicate events.
        • Local management services pass fabric discovery events from user space to the kernel. Our proposed design uses generic Netlink messages for communication of these component add/remove/modify events.

        We are leveraging our experience with writing Linux bridge drivers for three different Gen-Z hardware bridges in the design of the Gen-Z Linux sub-system. Most recently, we wrote the DOE Exa-scale PathForward project's bridge driver with data movers ( We wrote drivers for the Gen-Z Consortium's demonstration card that supports a block device and a NIC as well as a driver for the bridge in HPE's "The Machine" that is a precursor to Gen-Z.

        From our work so far, here are questions we would like feedback on:

        • We intend to expose control space in /sys so that user space fabric managers can work. We ask for feedback on the proposed hierarchy and mechanisms.
        • Gen-Z uses PASIDs and the sub-system could use generic PASID
          interfaces. Any interest in this elsewhere in the kernel?
        • We have need of generic IOMMU interfaces since Gen-Z ZMMU needs to interface with the IOMMU in a platform independent way. Any interest in this elsewhere in the kernel? We saw some patch sets along these lines.
        • We intend to use generic NetLink for communication between user space and the kernel. Any thoughts on that decision?
        • Gen-Z maps huge address spaces from remote components, and to get good performance those mappings need huge pages. Currently, the kernel does not support this use case. We would like to discuss how best to handle these huge mappings.
        • We wrote a parser for the Gen-Z specification's control structure that generates C structures with bitfields. In general, we know the Linux kernel frowns on bitfields. Are bitfields ok in this context?
        Speakers: Jim Hull (Hewlett Packard Enterprise), Betty Dall (HPE), Keith Packard (Hewlett Packard Enterprise)
      • 11:30
        Break 30m
      • 12:00
        Efficient Userspace Optimistic Spinning Locks 45m

        The most commonly used simple locking functions provided by the pthread library are pthread_mutex and pthread_rwlock. They are sleeping locks and so do suffer from unpredictable wakeup latency limiting locking throughput.

        Userspace spinning locks can potentially offer better locking throughput, but they also suffer other drawbacks like lock holder preemption which will waste valuable CPU time for those lock spinning CPUs. Another spinning lock problem is contention on the lock cacheline when a large number of CPUs are spinning on it.

        This talk presents a hybrid spinning/sleeping lock where a lock waiter can choose to spin in userspace or in the kernel waiting for the lock holder to release the lock. While spinning in the kernel, the lock waiters will queue up so that only the one at the queue head will be spinning on the lock reducing lock cacheline contention. If the lock holder is not running, the kernel lock waiters will go to sleep too so as not to waste valuable CPU cycles. The state of kernel lock spinners will be reflected in the value of lock. Thus userspace spinners can
        monitor the lock state and determine the best way forward.

        This new type of hybrid spinning/sleeping locks combine the best attributes of sleeping and spinning locks. It is especially useful for applications that need to run on large NUMA systems where potentially a large number of CPUs may be pounding on a given lock.

        Speaker: Mr Waiman Long (Red Hat)
      • 12:45
        Malloc for everyone and beyond NUMA 45m

        With heterogeneous computing, program's data (range of virtual addresses) have to move to different physical memory during the lifetime of an application to keep it local to compute unit (CPU, GPU, FPGA, ...). NUMA have been the model used so far but it has assumptions that do not work with all the memory type we now have. This presentation will explore the various types of memory and how we can expose and use them through unified API.

        Speaker: Jerome Glisse (Red Hat)
      • 13:30
        Lunch 1h 30m
      • 15:00
        Writing A Kernel Driver in Rust 45m

        In recent years, Rust has become a serious candidate for various
        projects. Given it's strong typing and memory model it lends itself
        for software that would usually have been written in C.
        Linux kernel drivers have traditionally been written in C as well.
        In contrast to the core kernel they are usually less strictly reviewed
        and may have been written by people that do not necessarily have the
        required expertise to interface with the kernel.
        While Rust may not be the best choice for the core kernel it may provide
        a useful alternative for kernel drivers.
        In this talk I will present my efforts to port a small filesystem I have
        written and upstreamed last year to Rust. This is very much WIP so failure
        is very much an option.

        Speaker: Mr Christian Brauner
      • 16:30
        Break 30m
    • 10:00 17:45
      Networking Summit Track
      • 10:00
        Scaling container policy management with kernel features 45m

        Cilium is an open source project which implements the Container Network
        Interface (CNI) to provide networking and security functions in modern
        application environments. The primary focus of the Cilium community recently
        has been on scaling these functions to support thousands of nodes and hundreds
        of thousands of containers. Such environments impose a high rate of churn as
        containers and nodes appear and leave the cluster. For each change, the
        networking plugin needs to handle the incoming events and ensure that policy is
        in sync with network configuration state. This creates a strong incentive to
        efficiently interpret and map down cluster events into the required Linux
        networking configuration to minimize the window during which there are
        discrepancies between the desired and realized state in the cluster---something
        that is made possible through eBPF and other kernel features.

        Cilium realizes these policy and container events through the use of many
        aspects of the networking stack, from rules to routes, tc to socket hooks,
        skb->mark to the skb->cb. Modelling the changes to datapath state involves a
        non-trivial amount of work in the userspace daemon to structure the desired
        state from external entities and allow incremental adjustments to be made,
        keeping the amount of work required to handle an event proportional to its
        impact on the kernel configuration. Some aspects of datapath configuration such
        as the implementation of L7 policy have gone through multiple iterations, which
        provides a window for us to explore the past, present and future of transparent

        This talk will discuss the container policy model used by Cilium to apply
        whitelist filtering of requests at layers 3, 4 and 7; memoization techniques
        used to cache intermediate policy computation artifacts; and impacts on
        dataplane design and kernel features when considering large container based
        deployments with high rates of change in cluster state.

      • 10:45
        Traffic footprint characterization of workloads using BPF 45m

        Application workloads are becoming increasingly diverse in terms of their network resource requirements and performance characteristics. As opposed to long running monoliths deployed in virtual machines, containerized workloads can be as short lived as few seconds. Today, container orchestrators that schedule these workloads primarily consider their CPU and memory resource requirements since they can easily be quantified. However, network resources characterization isn’t as straight forward. Ineffective scheduling of containerized workloads, which could be throughput intensive or latency sensitive, can lead to adverse network performance. Hence, I propose characterizing and learning network footprints of applications running in a cluster, which can be used while scheduling them in containers/VMs such that their network performance can be improved.

        There is a well-known network issue, which is achieving low latency for mice flows (those that send relatively small amounts of data) by separating them from the elephant flows (those that send a lot of data). I’ve written an eBPF program in C that runs at various hook points in the Linux connection tracking (aka conntrack) kernel functions in order to detect network elephant flow, and attribute them to the container or VM, where the flows ingress or egress from. The agent that loads this eBPF program from user space runs in every host in a cluster. It then feeds this learnt information to a container (or VM) scheduling system such that they can use this information proactively, while scheduling containerized workloads with light network footprint (e.g., microservices, functions) and heavy network footprint (e.g., data analytics, data computational applications) on the same cluster, in order to improve their latency and throughput, respectively.

        eBPF facilitates running the programs with minimal CPU overhead, in a pluggable, tunable and safe manner, and without having to change any kernel code. It’s also worthwhile to discuss how the workload’s learnt network footprint can be used for dynamically allocating or tuning Linux network resources like bandwidth, vcpu/vhost-net allocation, receive-side scaling (RSS) queue mappings, etc.
        I'll submit a paper with the (working) source code snippets and details if the talk is accepted.

      • 11:30
        Break 30m
      • 12:00
        XDP: the Distro View 45m

        It goes without saying that XDP is wanted more and more by everyone. Of course, the Linux distributions want to bring to users what they want and need. Even better if it can be delivered in a polished package with as few surprises as possible: receiving bug reports stemming from users' misunderstanding and from their wrong expectations does not make good experience neither for the users nor for the distro developers.

        XDP presents interesting challenges to distros: from the initial enablement (what config options to choose) and security considerations, through user supportability (packets "mysteriously" disappearing, tcpdump not seeing everything), through future extension (what happens after XDP is embraced by different tools, some of those being part of the distro, how that should interact with users' XDP programs?), to more high level questions, such as user perception ("how comes my super-important use case cannot be implemented using XDP?").

        Some of those challenges are long solved, some are in progress or have good workarounds, some of them are yet unsolved. Some of those are solely the distro's responsibility, some of them need to be addressed upstream. The talk will present the challenges of enabling XDP in a distro. While it will also mention the solved ones, its main focus are the problems currently unsolved or in progress. We'll present some ideas and welcome discussion about possible solutions using the current infrastructure and about future directions.

      • 12:45
        An Evaluation of Host Bandwidth Manager 45m

        Host Bandwidth Manager (HBM) is a BPF based framework for managing per-cgroupv2 egress and ingress bandwidths in order to provide a better experience to workloads/services coexisting within a host. In particular, HBM allows us to divide a host's egress and ingress bandwidth among workloads residing in different v2 cgroups. Note that although sample BPF programs are included in the BPF patches, one can easily use different algorithms for managing bandwidth.

        This talk presents an evaluation of HBM and associated BPF programs. It explores the performance of various approaches to bandwidth management for TCP flows that use Cubic, Cubic with ECN or DCTCP for their congestion control. For evaluating performance, we consider how well flows can utilize the allocated bandwidth, how many packets are dropped by HBM, increases to RTTs due to queueing, RPC size fairness, as well as RPC latencies. This evaluation is done independently for egress and ingress. In addition, we explore the use of HBM for protecting against incast congestion by also using HBM on the root v2 cgroup.

        Our testing shows that HBM, with the appropriate BPF program, is very effective at managing egress bandwidths regardless of which TCP congestion control algorithm is used, preventing flows from exceeding the allocated bandwidth while allowing them to use most of their allocation. Not surprisingly, effectively managing ingress bandwidth requires ECN, and preferably DCTCP. Finally, we show that using HBM is very effective at preventing packet losses due to incast congestion, as long as we are willing to sacrifice some ingress bandwidth.

      • 13:30
        Lunch 1h 30m
      • 15:00
        Improving Route Scalability with Nexthop Objects 45m

        Route entries in a FIB tend to be very redundant with respect to nexthop configuration with many routes using the same gateway, device and potentially encapsulations such as MPLS. The legacy API for inserting routes into the kernel requires the nexthop data to be included with each route specification leading to duplicate processing verifying the nexthop data, an effect that is magnified as the number of paths in the route increases (e.g., ECMP).

        A new API was recently committed to the kernel for managing nexthops as separate objects from routes. The nexthop API allows nexthops to be created first and then routes can be added referencing the nexthop object. This API allows routes to be managed with less overhead (e.g., dramatically reducing the time to insert routes) and enables new capabilities such as atomically updating a nexthop configuration without touching the route entries using it.

        This talk will discuss the nexthop feature touching on the kernel side implementation, reviewing the userspace API and what to expect for notifications, performance improvements and potential follow on features. While the nexthop API is motivated by Linux as a NOS, it is useful for other networking deployments as well such as routing on the host and XDP.

      • 15:45
        Life at a Networking Vendor -- Keeping up with the Joneses 45m

        Working for a networking hardware vendor can be an extremely rewarding experience for a kernel developer. The rate at which new features are accepted in the kernel also provides lots of motivation to develop new features that showcase hardware capabilities. This could be done by adding new support for dataplane offloads via cls flower, netfilter, or switchdev (if we still think it exists!). In-driver support for pre-SKB packet processing via XDP and AF_XDP also provide a chance for developers to search for new software optimizations in their driver receive and transmit path.

        In addition to thinking about what is happening upstream, developers at hardware vendors regularly find themselves managing internal and external expectations from those responsible for developing features that are not always exclusive to the Linux kernel. This could range from frameworks like DPDK and VPP that run on Linux or completely different OSes/stacks to functionality that is available without software interaction.

        There is no quicker way to develop new features and resolve issues than to have direct contact with hardware and firmware developers. The goal of this talk will be to share some experiences balancing the expectations of customers and partners along with those of the community.

      • 16:30
        Break 30m
      • 17:00
        Ethernet Cable Diagnostic using Netlink Ethtool API 45m

        Many Ethernet PHYs contain hardware to perform diagnostics of the
        Ethernet cable. Breaks in the cable and shorts within a twisted pair
        or to other pairs can be detected, and an estimate to the length along
        the cable to the fault can be made. The talk will explain, at a high
        level, how such diagnostics work, sending pulses down the cables and
        looking for reflections. There is no standardization on such
        diagnostics, and what information the PHY reports varies between
        vendors. The ongoing work to allow ethtool to make use of a netlink
        socket makes the ethtool API much more flexiable. This flexibility has
        been used to provide a generic API to request a PHY performs
        diagnostics tests and to report the results. Some aspects of this API
        will be discussed, using the Marvell PHYs as examples. The talk aims
        to spread knowledge on this work and encourage driver writers to
        implement diagnostics for other PHYs.

    • 10:00 13:30
      RDMA MC

      Following the success of the past 3 years at LPC, we would like to see a 4th RDMA (Remote Direct Memory Access networking) microconference this year. The meetings in the last conferences have seen significant improvements to the RDMA subsystem merged over the years: new user API, container support, testability/syzkaller, system bootup, Soft iWarp, etc.

      In Vancouver, the RDMA track hosted some core kernel discussions on get_user_pages that is starting to see its solution merged. We expect that again RDMA will be the natural microconf to hold these quasi-mm discussions at LPC.

      This year there remain difficult open issues that need resolution:

      RDMA and PCI peer to peer for GPU and NVMe applications, including HMM and DMABUF topics
      RDMA and DAX (carry over from LSF/MM)
      Final pieces to complete the container work
      Contiguous system memory allocations for userspace (unresolved from 2017)
      Shared protection domains and memory registrations
      NVMe offload
      Integration of HMM and ODP
      And several new developing areas of interest:

      Multi-vendor virtualized 'virtio' RDMA
      Non-standard driver features and their impact on the design of the subsystem
      Encrypted RDMA traffic
      Rework and simplification of the driver API
      Previous years:
      2018, 2017: 2nd RDMA mini-summit summary, and 2016: 1st RDMA mini-summit summary

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Leon Romanovsky, Jason Gunthorpe

      • 10:00
        HMM 30m
        Speaker: John Hubbard (NVIDIA)
      • 10:30
        GUP for P2P 30m
      • 11:00
        RDMA, File Systems, and DAX 30m

        For almost 2 years now the use of RDMA with DAX filesystems has been disabled due to the incompatibilities of RDMA and the file system page handling.

        A general consensus has emerged from many conferences and email threads on a path to support RDMA directly to persistent memory which is managed by a filesystem.

        This talk will present the work done since LSFmm to support RDMA and FS DAX.

        Specifically this work requires exclusive layout lease grants to obtain pins.
        Fails truncate operations on file pages which have been given pins. And supports recovery by admins by allowing them to identify offending processes holding these pins.

        Speaker: Mr Ira Weiny
      • 11:30
        Break 30m
      • 12:00
        Discussion about IBNBD/IBTRS Upstreaming: Action Items. 30m

        We are going through upstreaming IBNBD/IBTRS 5th iterations, the latest effort is here:

        We would like to discuss in an open round about the unique features of the driver and the library, whether and how they are beneficial for the RDMA eco-system and what should be the next steps in order to get them upstream.

        A face to face discussion about action items will smooth the path.

        Speakers: Mr Jinpu Wang (1 & 1 IONOS Cloud GmbH), Mr Danil Kipnis (1 & 1 IONOS Cloud GmbH)
      • 12:30
        Shared IB Objects 30m

        Consider a case of a server with a huge amount of memory and thousands of processes are using it to serve clients requests.

        In such a case, the HCA will have to manage thousands of MRs which will compete for caches and address translation entities.

        The way to improve performance is to allow sharing of IB objects between processes. One process will create several MRs and share them.

        This will reduce the number of address translation entries and cache miss dramatically.

        This talk will cover the implementation of a Shared Object mechanism.

        Speaker: Yuval Shaia (Oracle)
      • 13:00
        Improving RDMA performance through the use of contiguous memory and larger pages for files. 30m

        As memory sizes grow so do the sizes of the data transferred between RDMA devices. Generally, the Operating system needs to keep track of the state of each of its pieces of memory and that is on Intel x86 a page of 4 KB. This is also connected to hardware providing memory management features such as the processor page tables as well as the MMU features of the RDMA NIC.

        The overhead of the operating system increases as the number of these pages reaches ever higher orders of magnitude. I.e. for 4GB of data one needs 1 million of these page descriptors. Each page descriptor is a 64-byte cache line and thus a 4GB operation requires 64MB of cache lines to be managed.

        A lot of efforts on optimization of I/O focuses on avoiding touching these page descriptors through the use of larger contiguous memory or larger page sizes. This talk gives an overview of the current methods in use to avoid these lowdowns and the work in progress to improve the situation
        and make it less of an effort to avoid these issues.

        Speaker: Christopher Lameter (Jump Trading LLC)
    • 10:00 13:30
      RISC-V MC

      The Linux Plumbers 2019 RISC-V MC will continue the trend established in 2018 [2] to address different relevant problems in RISC-V Linux land.

      The overall progress in RISC-V software ecosystem since last year has been really impressive. To continue the similar growth, RISC-V track at Plumbers will focus on finding solutions and discussing ideas that require kernel changes. This will also result in a significant increase in active developer participation in code review/patch submissions which will definitely lead to a better and more stable kernel for RISC-V.

      Expected topics
      RISC-V Platform Specification Progress, including some extensions such as power management - Palmer Dabbelt
      Fixing the Linux boot process in RISC-V (RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done) - Atish Patra
      RISC-V hypervisor emulation [5] - Alistair Francis
      RISC-V hypervisor implementation - Anup Patel
      NOMMU Linux for RISC-V - Damien Le Moal
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Atish Patra ( or Palmer Dabbelt (

    • 10:00 13:30
      Tracing MC

      The Linux Plumbers 2019 is pleased to welcome the Tracing microconference again this year. Tracing is once again picking up in activity. New and exciting topics are emerging.

      There is a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out of tree tracers like LTTng, systemtap and Dtrace. Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

      Expected topics
      bpf tracing – Anything to do with BPF and tracing combined
      libtrace – Making libraries from our tools
      Packaging – Packaging these libraries
      babeltrace – Anything that we need to do to get all tracers talking to each other
      Those pesky tracepoints – How to get what we want from places where trace events are taboo
      Changing tracepoints – Without breaking userspace
      Function tracing – Modification of current implementation
      Rewriting of the Function Graph tracer – Can kretprobes and function graph tracer merge as one
      Histogram and synthetic tracepoints – Making a better interface that is more intuitive to use
      More to be added based on CfP for this microconference

      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC lead
      Steven Rostedt (

      • 10:00
        drgn: Programmable Debugging 22m

        drgn ( is a programmable debugger that makes it easy to introspect and debug state in the kernel. With drgn, it's possible to explore and analyze data structures with the full power of Python. See the LWN coverage of the presentation at LSF/MM: This presentation will demonstrate the capabilities of drgn, discuss future plans, and explore ways that the kernel and surrounding ecosystem can make introspection easier and more powerful.

        Speaker: Omar Sandoval
      • 10:22
        Kernel Boot Time Tracing 22m

        Tracing kernel boot is useful when we chase a bug in device and machine initialization, boot performance issue etc. Ftrace already supports to enable basic tracing features in kernel cmdline. However, since the cmdline is very limited and too simple, it is hard to enable complex features which are recently introduced, e.g. multiple kprobe events, trigger actions, and event histogram.
        To solve this limitation, I introduce a boot time tracing feature on new structured kernel cmdline, which allows us to write complex tracing features in treed key-value style text file.
        In this talk, I would like to discuss how this solves the boot time tracing, and the syntax of tracing subsystem for this structured kernel cmdline.

        Speaker: Masami Hiramatsu (Linaro Ltd.)
      • 10:44
        Sharing PMU counters across compatible perf events 22m

        Hardware PMU counters are limited resources. When there are more perf events than the available hardware counters, it is necessary to use time multiplexing, and the perf events could not run 100% of time.

        On the other hand, different perf events may measure the same metric, e.g., instructions. We call these perf events "compatible perf events". Technically, one hardware counter could serve multiple compatible events at the same time. However, current perf implementation doesn't allow compatible events to share hardware counters.

        There are efforts to enable sharing of compatible perf events. To the best of our knowledge, the latest attempt was Unfortunately, we haven't make much progress on this front.

        At Facebook we are investing on user space sharing of compatible performance counters to reduce the need for time multiplexing and the cost of context switch when monitoring the same events in several threads and cgroups. A kernel solution would be preferable.

        In the Tracing MC, we would like to discuss how we can enable PMU sharing compatible perf events. This topic may open other discussions in perf subsystem. We think this would be a fun section.

        Speakers: Song Liu, David Carrillo Cisneros (Facebook)
      • 11:06
        A trace-cmd front end interface to ftrace histogram, triggers and synthetic events. 22m

        Ftrace histograms, based on triggers and synthetic events were implemented few years ago by Tom Zanussi. They are very powerful instrument for analyzing the kernel internals, using ftrace events, but its user interface is very complex and hard to use. This proposal is to discuss possible ways to define more easy to use and intuitive interface to this feature, using trace-cmd application.

        Speaker: Tzvetomir Stoyanov
      • 11:30
        Coffee and Tea 30m
      • 12:00
        Unifying trace processing ecosystems with Babeltrace 22m

        Babeltrace started out as the reference implementation of a Common
        Trace Format (CTF) reader. As the project evolved, many
        trace manipulation use-cases (merging, trimming, filtering,
        conversion, analysis, etc.) emerged and were implemented either
        as part of the Babeltrace project, on top of its APIs or through
        custom tools.

        Today, as more tracers emerged, each using their own trace format, the
        tracing ecosystem has become fragmented making tools exclusive to
        certain tracers. The newest version of Babeltrace aims at bridging
        the gap between the various tracing ecosystems by making it easy
        to implement trace processing tools over an agnostic trace IR.

        The discussion will aim at identifying the work needed to accommodate
        the various tracers and their associated tooling (scripts, graphical
        viewers, etc.) over the next releases.

        Speaker: Jérémie Galarneau (EfficiOS/LTTng/Babeltrace)
      • 12:22
        libtrace - making libraries of our tracing tools 22m

        I would like to discuss how to implement a series of libraries for all the tracing tools that are out there, and have a repository that at least points to them. From libftrace, libperf, libdtrace to liblttng and libbabletrace.

        Speaker: Steven Rostedt
      • 12:44
        bpftrace 22m

        bpftrace is a high level tracing language running on top of BPF:

        We'll talk about important updates from the past year, including improved tracing providers and new language features, and we'll also discuss future plans for the project.

        Speaker: Mr Alastair Robertson (Yellowbrick)
      • 13:06
        BPF Tracing Tools: New Observability for Performance Analysis 22m

        Many new BPF tracing tools are about to be published, deepening our view of kernel internals on production systems. This session will summarize what has been done and what will be next with BPF tracing, discussing the challenges with taking kernel and application analysis further, and the potential kernel changes needed.

        Speaker: Brendan Gregg (Netflix)
    • 15:00 18:40
      BPF MC

      A BPF Microconference will be featured at this year's Linux Plumbers Conference (LPC) in Lisbon, Portugal.

      The goal of the BPF Microconference is to bring BPF developers together to discuss and hash out unresolved issues and to move new ideas forward. The focus of this year's event is on the core BPF infrastructure as well as its many subsystems and related user space tooling.

      The BPF Microconference will be open to all LPC attendees. There is no additional registration required. This is also a great occasion for BPF users and developers to meet face to face and to exchange and discuss developments.

      Similar to last year's BPF Microconference the main focus will be on discussion rather than pure presentation style.

      Therefore, each accepted topic will provide introductory slides with subsequent discussion as the main part for the rest of the allocated time slot. The expected time for one discussion slot is approximately 20 min.

      MC is lead by both BPF kernel maintainers:

      Alexei Starovoitov and Daniel Borkmann

      • 15:00
        Bringing BPF developer experience to the next level 23m

        The way BPF application developers build applications is constantly improving. There are still rough corners, as well as (as of yet) fundamentally inconvenient developer workflows involved (e.g., on-the-fly compilation). The ultimate goal of BPF application development is to provide experience as straightforward and simple as a typical user-land application.

        We'll discuss major pain points with BPF developer experience today and present motivation for solving them. Libbpf and BTF type info integration are at the center of the puzzle that's being put together to provide a powerful and yet less error-prone solution:
        - BPF CO-RE and how it is addressing adapting to ever-changing kernel and facilitates safe and efficient kernel introspection;
        - consistent and safer APIs to load/attach/work with BPF programs;
        - declarative and more powerful ways to define and initialize BPF maps;
        - providing and standardizing BPF-side helper library for all BPF code needs.

        Speaker: Andrii Nakryiko (Facebook)
      • 15:23
        BPF Debugging 22m

        Debugging BPF program logic is hard these days.
        Developers typically write their programs and
        then checking map values or perf_event outputs
        make sense or not. For tricky issues, temporary
        maps or bpf_trace_printk are used so developer
        can get more insight about what happens. But
        this requires possibly multiple rounds of
        modifying sources, recompilation and redeployment, etc.

        This discussion surrounds creating bpf debugging
        tool, bdb (bpf debugger) similar naming after gdb/lldb.
        This tool should try to do what gdb for ELF execution.
        - specify breakpoints at source/xlated/jitted level
        - retrieve data for registers, stacks and globals(maps)
        and presented at both register and variable level.
        - different conditions to retrieve data, e.g.,
        running 100 times, only if this variable == 1.
        this will require kernel to live patch bpf codes.
        - modifying data (register, stack slot, globals)?
        how does this interact with verifier to ensure safety.
        - this will leverage BTF and existing test_run framework.
        - production debugging vs. qemu debugging
        qemu debugging may be truely single-step.

        Speaker: Yonghong Song
      • 15:45
        A pure Go BPF library 22m

        At the LSF/MM eBPF track, we discussed the necessity of a common Go
        library to interact with BPF. Since then, Cilium and Cloudflare have
        worked out a proposal to upstream parts of
        and into a new common library.

        Our goal is to create a native Go library instead of a CGO wrapper
        of C libbpf. This provides superior performance, debuggability and
        ease of deployment. The focus will be on supporting long-running
        daemons interacting with the kernel, such as Cilium or Cloudflare's
        L4 load balancer.

        We’d like to present this proposal to the wider BPF community and
        solicit feedback. We’ll cover the goals and guiding principles we’ve
        set ourselves and our initial roadmap.

        Speakers: Joe Stringer (Isovalent / Cilium), Lorenz Bauer (Cloudflare), Martynas Pumputis
      • 16:07
        Do we need CAP_BPF_ADMIN? 23m

        Currently, most BPF functionality requires CAP_SYS_ADMIN or CAP_NET_ADMIN. However, in many cases, CAP_SYS_ADMIN/CAP_NET_ADMIN gives the user more than enough permissions. For example, tracing users need to load BPF programs and access BPF maps, so they need CAP_SYS_ADMIN. However, they don't need to modify the system, so CAP_SYS_ADMIN adds significant risk.

        To better control BPF functionality, this is time to think about CAP_BPF_ADMIN (or even multiple CAP_BPF_*s). In this BPF MC, we would like to discuss whether we need CAP_BPF_ADMIN, and what CAP_BPF_ADMIN would look like. We will present survey of major BPF use cases, and identify use cases that may benefit from a new CAP. Then, we will discuss which syscalls/commands should be gated by the new CAP. We expect constructive discussions between the BPF folks and security folks.

        Speaker: Song Liu
      • 16:30
        Coffee and Tea Break 30m
      • 17:00
        Reuse host JIT back-end as offload back-end 20m

        eBPF offload is a powerful feature on modern SmartNICs used to accelerate
        XDP or TC based BPF. The current kernel eBPF offload infrastructure was
        introduced for the Netronome NFP based SmartNICs, these were based around a
        proprietary ISA and had some specific verifier requirements.

        In the near future this may be joined by SmartNICs using public ISA's such
        as RISC-V and Arm which also happen to be used as host CPUs. This talk will
        discuss the implications of reusing these ISAs and other back-end features
        for offload to a sea of cores as well as how much of a host CPU back-ends
        can be reused and what additional infrastructure may be needed. As an
        example we will use the current work on a many core RISC-V processor
        ongoing within.

        Speaker: Mr JIONG WANG (Netronome Systems)
      • 17:20
        Using SCEV to establish pre and post-conditions over BPF code 20m

        Currently, the BPF verifier has to "execute" code at least once and then it can prune branches when it detects the state is the same. In this session we would like to cover a technique called Scalar Evolution (SCEV) which is used by LLVM and GCC to perform optimization passes such as identifying and promoting induction variables and do worst case trip analysis over loops. At its most basic usage SCEV finds the start value of variables, the variables stride and the variables ending value over a block of code. Building a SCEV pass into the BPF verifier would allow us to create a set of pre and post conditions over blocks of BPF codes.

        We see this as potentially useful to avoid "executing" loops in the verifier and instead allowing the verifier to check pre-conditions before entering the loop. And additionally establishing pre and post conditions on function calls to avoid having to execute the verifier on functions repeatedly. We suspect this will likely be necessary to support shared libraries for example.

        The goal of the session will be to do a brief introduction to SCEV. Provide a demonstration of some early prototype work that can build pre and post conditions over blocks of BPF code. Then discuss next steps for possible inclusion.

        Speaker: Mr John Fastabend (Isovalent)
      • 17:40
        Beyond per-CPU atomics and rseq syscall: subset of eBPF bytecode for the do_on_cpu syscall 20m

        The Restartable Sequences system call [1,2,3,4] introduced in Linux 4.18 has limitations which can be solved by introducing a bytecode interpreter running in inter-processor interrupt context which accesses user-space data.

        This discussion is about the subset of the eBPF bytecode and context needed by this interpreter, and extensions of that bytecode to cover load-acquire and store-conditional memory accesses, as well as memory barrier instructions. The fact that the interpreter needs to allow loading data from userspace (tainted data), which can then be used as address for loads and stores, as well as conditional branches source register, will also be discussed.

        [1] "PerCpu Atomics"
        [2] "Restartable sequences"
        [3] "Restartable sequences restarted"
        [4] "Restartable sequences and ops vectors"

        Speaker: Mathieu Desnoyers (EfficiOS Inc.)
      • 18:00
        Kernel Runtime Security Instrumentation (KRSI) 20m

        Existing Linux Security Modules can only be extended by modifying and rebuilding the kernel, making it difficult to react to new threats. The Kernel Runtime Security Instrumentation project (KRSI) (prototype code) aims to help this by providing an LSM that allows eBPF programs to be added to security hooks.

        The talk discusses the need for such an LSM (with representative use cases) and compares it to some existing alternatives, such as Landlock, a separate custom LSM, kprobes+eBPF etc. The second half of the talk outlines the proposed design and interfaces, and includes a live demo.

        KRSI is an LSM that:

        • Allows the attachment of eBPF programs to security hooks.
        • Provides a good ecosystem of safe eBPF helper functions specifically written with security and auditing features in mind.

        This enables the development of a new class of userspace security products that:

        • Reduce the overhead of building and updating the kernel/LSM when a new security vulnerability is discovered.
        • Allows the system owners to choose the format in which the data is audit logged.
          Provide flexibility w.r.t granularity of auditing needed and add new auditing without needing to re-build or update the LSM/Kernel (in contrast to the existing audit framework)

        The intended audience for this talk would be:

        • Security-focused kernel engineers
        • Engineers building user-space security products on Linux.
        • Security Engineers and Admins who care about the time required to deploy security software to detect and prevent a new class of malicious activity.
        Speaker: Mr KP Singh
      • 18:20
        Map batch processing 20m

        bcc community has long discussed that batch
        dump, lookup and delete will help its typical
        use case, periodically retrieving and deleting
        all samples in the kernel. Without batch APIs,
        bcc typically does
        iterate through all keys (get_next_key API)
        get (key, value) pairs
        iterate through all keys to delete them

        Also, Brian Vazquez
        has proposed BPF_MAP_DUMP command to dump
        more than one entry per syscall call.

        This discussion will propose new bpf subcommands
        for map batch processing, e.g., batching
        discuss its pros and cons etc.

        Looks the subject has been discussed actively in the mailing list.
        If the discussion reached its maturity, we may not need to discuss
        in the conference.

        Speaker: Yonghong Song
    • 15:00 18:30
      Live Patching MC

      The main purpose of the Linux Plumbers 2019 Live Patching microconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make live patching of the Linux kernel and the Linux userspace live patching feature complete.

      The intention is to mainly focus on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.

      This proposal follows up on the history of past LPC live patching microconferences that have been very useful and pushed the development forward a lot.

      Currently proposed discussion/presentation topic proposals (we've not gone through "internal selection process yet") with tentatively confirmed attendance:

      5 min Intro - What happened in kernel live patching over the last year
      API for state changes made by callbacks [1][2]
      source-based livepatch creation tooling [3][4]
      klp-convert [5][6]
      livepatch developers guide
      userspace live patching
      If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

      MC leads
      Jiri Kosina and Josh Poimboeuf

    • 15:00 18:30
      System Boot and Security MC
    • 18:45 19:45
      Closing Plenary 1h
    • 20:00 23:00
      Closing Party 3h
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now