The Linux Plumbers Conference is the premier event for developers working at all levels of the plumbing layer and beyond. LPC 2018 will be held November 13-15 in Vancouver, BC, Canada. We are looking forward to seeing you there!
The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.
Interactive applications, which includes everything from real time
games through flight simulators and virtual reality environments,
place strong real-time requirements on the whole computing environment
to ensure that the correct data are presented to the user at the
correct time. This requires two things; the first is that the time
when the information will be displayed be known to the application so
that the correct contents can be computed, and second that the frame
actually be displayed at that time.
These two pieces of information are managed inconsistently through the
graphics stack, making it difficult for applications to provide a
smooth animation experience to users. And because of the many APIs
which lie between the application rendering using OpenGL or Vulkan and
the underlying hardware, a failure to correctly handle things at
any point along the chain will result in stuttering results.
Fixing this requires changes throughout the system, from making the
kernel provide better control and information about the queuing and
presentation of images through changes in composited window systems to
ensure that images are displayed at the target time, and that
the actual time of presentation is reported back to applications and
finally additions to rendering APIs like Vulkan to expose control
over image presentation times and feedback about when images ended up
being visible to the user.
This presentation will first demonstrate the effects of poor display
timing support inherent in the current graphics stack, talk about the
different solutions required at each level of the system and finally
show the working system.
Software that uses a 32-bit integer to represent seconds since the Unix epoch of Jan 1 1970 is affected by that variable overflowing on Jan 19 2038, often in a catastrophic way. Aside from most 32-bit binaries that use timestamps, this includes file systems (e.g. ext3 or xfs), file formats (e.g. cpio, utmp, core dumps), network protocols (e.g. nfs) and even hardware (e.g. real-time clocks or SCSI adapters).
Work has been going on to avoid that overflow in the Linux kernel, with hundreds of patches reworking drivers, file systems and the user space interfaces including over 50 affected system calls.
With much of this activity getting done during 2018, it's time to give an update on what has been achieved in the kernel, what parts still remain to be solved, and how we will proceed to solve this in user space, and how to use the work in long-living product deployments.
It is common to see Linux being used on real-time research projects. However, the assumptions made in papers are very often unrealistic. In contrast, researchers argue that the main metric used on PREEMPT RT, although useful, is an oversimplification of the problem.
It is a consensus that the academic research helps to improve Linux’s state-of-art, and vice-versa. So how can we reduce the gap between these task forces? The real-time researchers start papers with a clear definition of the task model. But we do not have a task model for Linux: this is where the gap is.
This talk presents effort on establishing the task model for the PREEMPT RT Linux. Starting with the description of the operations that influence the timing behavior of tasks, passing by the definition of the relationships of the operations. Finally, the outcomes for Linux, like new metrics for the PREEMPT RT and a model validator (a lockdep like verificator, but for preemption) for the kernel, are discussed.
The SCHED_DEADLINE scheduling policy is all but done. Even though it existed in mainline for several years, many features are yet to be implemented; some are already available as immature code, some others only exist as wishes.
In this talk Juri Lelli and Daniel Bristot De Oliveira will give the audience in-depth details of what’s missing, what’s under development and what might be desirable to have. The intent is to provide as much information as possible to people attending, so that a fruitful discussion might be held later on during hallway and micro conference sessions.
Examples of what is going to be presented are:
A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.
Official Networking Track website: http://vger.kernel.org/lpc-networking.html
Openning welcome, announcements, etc.
XDP already offers rich facilities for high performance packet
processing, and has seen deployment in several production systems.
However, this does not mean that XDP is a finished system; on the
contrary, improvements are being added in every release of Linux, and
rough edges are constantly being filed down. The purpose of this talk is
to discuss some of these possibilities for future improvements,
including how to address some of the known limitations of the system. We
are especially interested in soliciting feedback and ideas from the
community on the best way forward.
The issues we are planning to discuss include, but are not limited to:
User experience and debugging tools: How do we make it easier for
people who are not familiar with the kernel or XDP to get to grips
with the system and be productive when writing XDP programs?
Driver support: How do we get to full support for XDP in all drivers?
Is this even a goal we should be striving for?
Performance: At high packet rates, every micro-optimisation counts.
Things like inlining function calls in drivers are important, but also
batching to amortise fixed costs such as DMA mapping. What are the
known bottlenecks, and how do we address them?
QoS and rate transitions: How should we do QoS in XDP? In particular,
rate transitions (where a faster link feeds into a slower) are
currently hard to deal with from XDP, and would benefit from, e.g.,
Active Queue Management (AQM). Can we adapt some of the AQM and QoS
facilities in the regular networking stack to work with XDP? Or should
we do something different?
Accelerating other parts of the stack: Tom Herbert started the
discussion on accelerating transport protocols with XDP back in 2016.
How do we make progress on this? Or should we be doing something
different? Are there other areas where we can extend XDPs processing
model to provide useful accelerations?
XDP is a framework for running BPF programs in the NIC driver to allow
decisions about the fate of a received packet at the earliest point in
the Linux networking stack. For the most part the BPF programs rely on
maps to drive packet decisions, maps that are managed for example by a
userspace agent. This architecture has implications on how the system is
configured, monitored and debugged.
An alternative approach is to make the kernel networking tables
accessible by BPF programs. This approach allows the use of standard
Linux APIs and tools to manage networking configuration and state while
still achieving the higher performance provided by XDP. An example of
providing access to kernel tables is the recently added helper to allow
IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites
such as FRR manage the FIB tables, and the XDP packet path benefits by
automatically adapting to the FIB updates in real time. While a huge
first step, a FIB lookup alone is not sufficient for general networking
This talk discusses the advantages of making kernel tables available to
XDP programs to create a programmable packet pipeline, what features
have been implemented as of October 2018, key missing features, and
Over the past several years, BPF has steadily become more powerful in multiple
ways: Through building more intelligence into the verifier which allows more
complex programs to be loaded, and through extension of the API such as by
adding new map types and new native BPF function calls. While BPF has its roots
in applying filters at the socket layer, the ability to introspect the sockets
relating to traffic being filtered has been limited.
To build such awareness into a BPF helper, the verifier needs the ability to
track the safety of the calls, including appropriate reference counting upon
the underlying socket. This talk walks through extensions to the verifier to
perform tracking of references in a BPF program. This allows BPF developers to
extend the UAPI with functions that allocate and release resources within the
execution lifetime of a BPF program, and the verifier will validate that the
resources are released exactly once prior to program completion.
Using this new reference tracking ability in the verifier, we add socket lookup
and release function calls to the BPF API, allowing BPF programs to safely find
a socket and build logic upon the presence or attributes of a socket. This can
be used to load-balance traffic based on the presence of a listening
application, or to implement stateful firewalling primitives to understand
whether traffic for this connection has been seen before. With this new
functionality, BPF programs can integrate more closely with the networking
stack's understanding of the traffic transiting the kernel.
In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms. Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.
Note: We plan to test on a much larger cluster and have those results available before the conference.
Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in most recent years on data center switches. It is complete in its feature set with forwarding, learning, proxy and snooping functions. It can bridge Layer-2 domains between VM's, Containers, Racks, POD's and between data centers as seen with Ethernet-Virtual Private networks [1, 2]. With Linux bridge deployments moving up the rack, it is now bridging Larger Layer-2 domains bringing in scale challenges. The bridge forwarding database can scale to thousands of entries on a data center switch with hardware acceleration support.
In this paper we discuss performance and operational challenges with large scale bridge fdb database and solutions to address them. We will discuss solutions like fdb dst port failover for faster convergence, faster API for fdb updates from control plane and reducing number of fdb dst ports with Light weight tunnel endpoints for bridging over a tunneling solution (eg vxlan).
Most solutions though discussed around the below deployment scenarios are generic and can be applied to all bridge use-cases:
The eXpress Data Path (XDP) is a new kernel-feature, intended to provide
fast packet processing as close as possible to device hardware. XDP
builds on top of the extended Berkely Packet Filter (eBPF) and allows
users to write a C-like packet processing program, which can be attached
to the device driver’s receiving queue. When the device observes an
incoming packet, the user-defined XDP program is triggered to execute on
the packet payload, making the decision as early as possible before
handing the packet down the processing pipeline.
P4 is a domain-specific language describing how packets are processed by
the data plane of a programmable network elements, including network
interface cards, appliances, and virtual switches. It provides an
abstraction that allows programmers to express existing and future
protocol format without coupling it to any data plane specific
knowledge. The language is explicitly designed to be protocol-agnostic.
A P4 programmer can write their own protocols and load the P4 program
into P4-capable network elements.
As high-level networking language, P4 supports a diverse set of compiler
backends and also possesses the capability to express eBPF and XDP programs.
We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages
XDP to aim for a high performance software data plane. The backend
generates a eBPF-compliant C representation from a given P4 program
which is passed to clang and llvm to produce the bytecode. Using
conventional eBPF kernel hooks the program can then be loaded into the
eBPF virtual machine in the device driver. The kernel verifier
guarantees the safety of the generated code. Any packets
received/transmitted from/to this device driver now trigger the
execution of the loaded P4 program.
The P4C-XDP project is an open source project hosted at
https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample
code under the tests directory, which contains a couple of examples such
as basic protocol parsing, checksum recalculation, multiple tables
lookups, and tunnel protocol en-/decapsulation.
Port mirroring is one of the most common network troubleshooting
techniques. SPAN (Switch Port Analyzer) allows a user to send a copy
of the monitored traffic to a local or remote device using a sniffer
or packet analyzer. RSPAN is similar, but sends and received traffic
on a VLAN. ERSPAN extends the port mirroring capability from Layer 2
to Layer 3, allowing the mirrored traffic to be encapsulated in an
extension of the GRE (Generic Routing Encapsulation) protocol and sent
through an IP network. In addition, ERSPAN carries configurable
metadatas (e.g., session ID, timestamps), so that the packet analyzer
has better understanding of the packets.
ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in
4.16. The implementation includes both transmission and reception and
is based on the existing ip_gre and ip6_gre kernel module. As a
result, Linux today can act as an ERSPAN traffic source sending the
ERSPAN mirrored traffic to the remote host, or an ERSPAN destination
which receives and parses the ERSPAN packets generated from Cisco or
other ERSPAN-capable switches.
We’ve added both the native tunnel support and metadata-mode tunnel
support. In this paper, we demonstrate three ways to use the ERSPAN
protocol. First, for Linux users, using iproute2 to create native
tunnel net device. Traffic sent to the net device will be
encapsulated with the protocol header accordingly and traffic matching
the protocol configuration will be received from the net device.
Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN
tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can
read/write ERSPAN protocol’s fields in finer granularity. Finally,
for Open vSwitch users, using the netlink interface to create a switch
and programmatically parse, lookup, and forward the ERSPAN packets
based on flows installed from the userspace.
AF_XDP is a new socket type for raw frames to be introduced in 4.18
(in linux-next at the time of writing). The current code base offers
throughput numbers north of 20 Mpps per application core for 64-byte
packets on our system, however there are a lot of optimizations that
could be performed in order to increase this even further. The focus
of this paper is the performance optimizations we need to make in
AF_XDP to get it to perform as fast as DPDK.
We present optimization that fall into two broad categories: ones that
are seamless to the application and ones that requires additions to
the uapi. In the first category we examine the following:
Loosen the requirement for having an XDP program. If the user does
not need an XDP program and there is only one AF_XDP socket bound to
a particular queue, we do not need an XDP program. This should cut
out quite a number of cycles from the RX path.
Wire up busy poll from user space. If the application writer is
using epoll() and friends, this has the potential benefit of
removing the coherency communication between the RX (NAPI) core and
the application core as everything is now done on a single
core. Should improve performance for a number of use cases. Maybe it
is worth revisiting the old idea of threaded NAPI in this context
Optimize for high instruction cache usage through batching as has
been explored in for example Cisco's VPP stack and Edward Cree in
his net-next RFC "Handle multiple received packets at each stage".
In the uapi extensions category we examine the following
Support a new mode for NICs with in-order TX completions. In this
mode, the completion queue would not be used. Instead the
application would simply look at the pointer in the TX queue to see
if a packet has been completed. In this mode, we do not need any
backpreassure between the completion queue and the TX queue and we
do not need to populate or publish anything in the completion queue
as it is not used. Should improve the performance of TX for in-order
Introduce the "type-writer" model where each chunk can contain
multiple packets. This is the model that e.g., Chelsio has in its
NICs. But experiments show that this mode also can provide better
performance for regular NICs as there are fewer transactions on the
queues. Requires a new flag to be introduced in the options field of
With these optimization, we believe we can reach our goal of close to
40 Mpps of throughput for 64-byte packets in zero-copy mode. Full
analysis with performance numbers will be presented in the final
iptables have been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers. In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.
Design and Implementation:
Performance benefits and comparisons with iptables
Ease of policy / config updates and maintenance
BPF Program array
Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering
This talk is a continuation of the initial XDP HW-based hints work presented at NetDev 2.1 in Seoul, South Korea.
It will start with focus on showcasing new prototypes to allow an XDP program to request required HW-generated metadata hints from a NIC. The talk will show how the hints are generated by the NIC and what are the performance characteristics for various XDP applications. We also want to demonstrate how such a metadata can be helpful for applications that use AF_XDP sockets.
The talk with then discuss planned upstreaming thoughts, and look to generate more discussion around implementation details, programming flows, etc., with the larger audience from the community.
SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN
IETF Working Group in the early 2000's with the initial objective of
supporting the transport of PSTN signalling over IP networks. It featured
multi-homing and multi-stream from the beginning, and since then there
have been a number of improvements that help it serve other purposes too,
such as support for Partial Reliability and Stream Scheduling.
Linux SCTP arrived late and was stuck. It wasn't as up to date as the
released RFCs, while it was also far behind other systems such as BSD,
and also suffered from performance problems. In the past 2 years, we
were dedicated to ensuring that these features were addressed and
focused on making many improvements. Now all the features from released
RFCs have been fully supported in Linux, and some from draft RFCs are
already ongoing. Besides, we've seen an obvious improvement in performance
in various scenarios.
In this talk we will first do a quick review on SCTP basics, including:
Then go through the improvements that were made in the past 2 years,
We will finish by reviewing a list of what is on our radar as well as next
For its powerfulness and complexity, SCTP is destined to face many challenges
and threats, but we believe that we have already and will continue to make it
better than that on other systems, but also than other transport protocols.
Please join us, Linux SCTP needs your help too!
The Linux Plumbers 2018 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.
Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.
Plans for participants and talks will be similar to last year's (https://blog.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/641).
This proposal is to gather hackers interested in improving the thermal subsystem in the Linux Kernel and its interaction with hardware and userspace based policies. Nowadays, given the nature of the workloads and the wide spectrum of devices that Linux is used into, the peers interested in improving the thermal subsystem come from different backgrounds and with use cases of diverse thermal constrained systems, ranging from embedded devices to systems with high computing power. Despite the heterogeneity of software solutions to control thermals, the thermal subsystem is still core of many of them, including policies that rely on hardware configured thresholds, or interactions with firmware based control loops, or even policies that rely on userspace daemons. Therefore, this micro conference aims on gathering the thermal interested developers of the community to discuss further improvements of the Linux thermal subsystem.
The Containers micro-conference at LPC is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.
Historically, kernels that ran on Android devices have typically been 2+ years old compared to mainline (this year's flagship devices are shipping with 4.9 kernels) and because of the challenges associated with updating, most devices in the field are far behind the latest long-term stable (LTS) releases. The Android team has been gradually putting in place the necessary processes and enhancements to permanently bridge this gap. Much of the work on the Android kernel in 2018 focused on improving the ability to update the kernel -- at least to recent LTS levels. This work comprises a significant testing effort to ensure downstream partners that updating to new LTS levels is safe, as well as process work to convince partners that the security benefits of taking LTS patches far outweigh the risk of new bugs. The testing also focuses on ABI consistency (within LTS releases) for interfaces relied upon by userspace and kernel modules. This has resulted in enhancements to the LTP suite and a new proposal to the mailing list for "kernel namespaces".
Additionally, the Android kernel testing benefits from additional tools developed by Google that are enabled via the Clang compiler. Google's devices have been shipping kernels built via Clang for 2 years. The Android team tests and assists in maintaining arm and arm64 kernel builds with clang.
The talk will also cover some of the key features being developed for Android and introduce topics that will be discussed during the Android Micro-Conference.
Heterogeneous computing use massively parallel devices, such as GPU, to crunch through huge data-set. This talks intends to present the issues, challenges and problems related to memory management and heterogeneous computing. Issues and problems from one address space per device which makes exchanging or sharing data-set between devices and CPUs hard, complex and error prone.
Solutions involve a unified address space between devices and CPU often call SVM (Share Virtual Memory) or SVA (Share Virtual Address). In those unified address space a virtual address valid on CPUs is also valid on the devices. Talk will address both hardware and software solutions to this problem. Moreover it will consider ways to preserve the ability to use the device memory in those scheme.
Ultimately this talks is an opportunity to discuss memory placement, like for NUMA architecture, in a world where we not only have to worry about CPU but also about devices like GPU and their associated memory.
If it were not enough, we now also have to worry about memory hierarchy for each CPU or device. Memory hierarchy going from fast High Bandwidth Memory (HBM) to main memory (DDR DIMM) which can be order of magnitude slower, and finally to persistent memory which is large in size but slower and with higher latency.
It is well known that developers do not like writing documentation. But although documenting the code may seem dull and unrewarding, it has definite value for the writer.
When you write the documentation you gain an insight into the algorithms, design (or lack of such), and implementation details. Sometimes you see neat code and say "Hey, that's genius!". But sometimes you discover small bugs or chunks of code that beg for refactoring. In any case, your understanding of the system significantly improves.
I'd like to share the experience I had with Linux memory management documentation, what was it's state a few months ago, what have been done and where are we now.
The work on the memory management documentation is in progress and the question "Where do we want to be?" is definitely a topic for discussion and debate.
The first rule of kernel maintenance is that there are no hard and fast rules. While there are several documents and guidelines on patch contribution, advice on how to serve in a maintainer role has historically been tribal knowledge. This organically grown state of affairs is both a source strength and a source of friction. It has served the community well to be adaptable to the different personalities and technical problem spaces that inhabit the kernel community. However, that variability also leads to inconsistent experiences for contributors across subsystems, insufficient guidance for new maintainers, and excess stress on current maintainers. As the Linux kernel project expects to continue its rate of growth it needs to be able both scale the maintainers it has and ramp new ones without necessarily requiring them to make a decade's worth of mistakes to become proficient.
The presentation makes the case for why a maintainer handbook is needed, including frequently committed mistakes and commonly encountered pain points. It broaches the "whys" and "hows" of contributors having significantly different experiences with the Linux Kernel project depending on what subsystem they are contributing. The talk is an opportunity to solicit maintainers in the audience on the guidelines they would reference on an ongoing basis, and it is an opportunity for contributors to voice wish list items when working with upstream. Finally, it will be a call to action to expand the document with subsystem-local rules of the road where those local rules differ, or go beyond the common recommendations.
Since 2004 a project has been going on trying to make the Linux kernel into a true hard Real-Time operating system. This project has become know as PREEMPT_RT (formally the "real-time patch"). Over the past decade, there was a running joke that this year PREEMPT_RT would be merged into the mainline kernel, but that has never happened. In actuality, it has been merged in pieces. Examples of what came from PREEMPT_RT include: mutexes, high resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The only thing left is turning spin_locks into mutexes, and that is now mature enough to make its way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged!
Getting PREEMPT_RT into the kernel was a battle, but it is not the end of the war. Once PREEMPT_RT is in mainline, there's still much more work to be done. The RT developers have been so focused on getting RT into mainline, that little has been thought about what to do when it is finally merged. There is a lot to discuss about what to do after RT is in mainline. The game is not over yet.
Binding and Devicetree Source/DTB Validation: update and next steps
Binding specification format
Validation Process and Process
How to validate overlays
Devicetree Specification: update and next steps
Reducing devicetree memory and storage size
Bootloader and Linux kernel implementation update
Remaining blockers and issues
Devicetree compiler (dtc)
Next version of DTB/FDT format
Motivated by desire to replace metadata being encoded as normal data (metadata for overlays)
Other desired changes should be considered
Boot and Run-time Configuration
Pain points and needs
Feedback from the trenches
how DTOs are used in embedded devices in practice
in U-Boot and Linux
in systems with FPGAs
Use of devicetrees in small code/data space (e.g. U-Boot SPL)
Connector node bindings
Containers (or Operating System based Virtualization) are an old
technology; however, the current excitement (and consequent
investment) around containers provides interesting avenues for
research on updating the way we build and manage container technology.
The most active area of research today, thanks to concerns raised by
groups supporting other types of virtualization, is in improving the
security properties of containers.
The first step in improving security is actually being able to measure
it in the first place, so the initial goal of a research programme for
container security involves finding that measure. In this talk I'll
outline one such measure (attack profiles) developed by IBM research,
the useful results that can be derived from it, the problems it has
and the avenues that can be explored to refine future measurements of
Contrary to popular belief, a "container" doesn't describe one fixed
thing, but instead is a collective noun for a group of isolation and
resource control primitives (in Linux terminology called namespaces
and cgroups) the composition of which can be independently varied. In
the second half of this talk, we'll explore how containment can be
improved by replacing some of the isolation primitives with local
system call emulation sandboxes, a promising technique used by both
the Google gVisor and the IBM Nabla secure container systems. We'll
also explore the question of whether sandboxes are the end point of
container security research or merely point the way to the next
Frontier for container abstraction.
Using graphics cards for compute acceleration has been a major shift in technology lately, especially around AI/ML and HPC.
Until now the clear market leader has been the CUDA stack from NVIDIA, which is a closed source solution that runs on Linux. Open source applications like tensorflow (AI/ML) rely on this closed stack to utilise GPUs for acceleration.
Vendor aligned stacks such as AMD's ROCm and Intel's OpenCL NEO are emerging that try to fill the gap for their specific hardware platforms. These stacks are very large, and don't share much if any code. There are also efforts inside groups like Khronos with their OpenCL, SPIR-V and SYCL standards being made to produce something that can work as a useful standardised alternative.
This talk will discuss the possibility of creating a vendor neutral reference compute stack based around open source technologies and open source development models that could execute compute tasks across multiple vendor GPUs. Using SYCL/OpenCL/Vulkan and the open-source Mesa stack, as the basis for a future task that development of tools and features on top of as part of a desktop OS.
This talk doesn't have all the answers, but it wants to get people considering what we can produce in the area.
Side channel attacks are here to stay. What can we do inside the operating system to proactively defend against them? This talk will walk through a few of the ideas that Intel’s Open Source Technology Center are developing to improve our resistance to side channel attacks as part of our new side channel defense project. We would also like to gather ideas from the rest of the community on what our top priorities for side channel defense for the Linux kernel should be.
Plugging in USB sticks, building VM images, and unprivileged containers all give rise to a situation where users are mounting and dealing with filesystem images they have not built themselves, and don't necessarily want to trust.
This leads to the problem of how to mount and read/write those filesystems without opening yourself up to more risk than visiting a web page.
I will survey what has been built already, describe what the technical challenges and describe the problems ahead.
With this talk I hope to unite the various groups across the linux ecosystem that care about this problem and get the discussion started on how we can move forward.
A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David S. Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”.
Official Networking Track website: http://vger.kernel.org/lpc-networking.html
This talk is divided into two parts, first we present on kTLS, the current kernel's
sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and
strparser framework which is utilized by both in order to hook into socket callbacks
and determine message boundaries for subsequent processing.
We further elaborate on the challenges we face when trying to combine kTLS with the
power of BPF for the eventual goal of allowing in-kernel introspection and policy
enforcement of application data before encryption. Besides others, this includes a
discussion on various approaches to address the shortcomings of the current ULP layer,
optimizations for strparser, and the consolidation of scatter/gather processing for
kTLS and sockmap as well as future work on top of that.
UDP is a popular foundation for new protocols. It is available across
operating systems without superuser privileges and widely supported
by middleboxes. Shipping protocols in userspace on top of
a robust UDP stack allows for rapid deployment, experimentation
and innovation of network protocols.
But implementing protocols in userspace has limitations. The
environment lacks access to features like high resolution timers
and hardware offload. Transport cost can be high. Cycle count of
transferring large payloads with UDP can be up to 3x that of TCP.
In this talk we present recent and ongoing work, both by the authors
and others, at improving UDP for content delivery.
UDP Segmentation offload amortizes transmit stack traversal by
sending as many as 64 segments as one large fused large packet.
The kernel passes this through the stack as one datagram, then
splits it into multiple packets and replicates their network and
transport headers just before handing to the network device.
Some devices can offload segmentation for exact multiples of
segment size. We discuss how partial GSO support combines the
best of software and hardware offload and evaluate the benefits of
segmentation offload over standard UDP.
With these large buffers, MSG_ZEROCOPY becomes effective at
removing the cost of copying in sendmsg, often the largest
single line item in these workloads. We extend this to UDP and
evaluate it on top of GSO.
Bursting too many segments at once can cause drops and retransmits.
SO_TXTIME adds a release time interface which allows offloading of
pacing to the kernel, where it is both more accurate and cheaper.
We will look at this interface and how it is supported by queuing
disciplines and hardware devices.
Finally, we look at how these transmit savings can be extended to
the forwarding and receive paths through the complement of GSO,
GRO, and local delivery of fused packets.
Among the various ways of using eBPF, OVS has been exploring the power
of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of
processing to XDP, and (3) by-passing the kernel using AF_XDP.
Unfortunately, as of today, none of the three approaches satisfies the
requirements of OVS. In this presentation, we’d like to share the
challenges we faced, experience learned, and seek for feedbacks from
the community for future direction.
Attaching eBPF to TC started first with the most aggressive goal: we
planned to re-implement the entire features of OVS kernel datapath
under net/openvswitch/* into eBPF code. We worked around a couple of
limitations, for example, the lack of TLV support led us to redefine a
binary kernel-user API using a fixed-length array; and without a
dedicated way to execute a packet, we created a dedicated device for
user to kernel packet transmission, with a different BPF program
attached to handle packet execute logic. Currently, we are working on
connection tracking. Although a simple eBPF map can achieve basic
operations of conntrack table lookup and commit, how to handle NAT,
(de)fragmentation, and ALG are still under discussion.
Moving one layer below TC is called XDP (eXpress Data Path), a much
faster layer for packet processing, but with almost no extra packet
metadata and limited BPF helpers support. Depending on the complexity
of flows, OVS can offload a subset of its flow processing to XDP when
feasible. However, the fact that XDP has fewer helper function support
implies that either 1) only very limited number of flows are eligible
for offload, or 2) more flow processing logic needed to be done in
AF_XDP represents another form of XDP, with a socket interface for
control plane and a shared memory API for accessing packets from
userspace applications. OVS today has another full-fledged datapath
implementation in userspace, called dpif-netdev, used by DPDK
community. By treating the AF_XDP as a fast packet-I/O channel, the
OVS dpif-netdev can satisfy almost all existing features. We are
working on building the prototype and evaluating its performance.
OVS eBPF datapath.
Over the last 10 years the world has seen NICs go from single port,
single netdev devices, to multi-port, hardware switching, CPU/NFP
having, FPGA carrying, hundreds of attached netdev providing,
behemoths. This presentation will begin with an overview of the
current state of filtering and scheduling, and the evolution of the
kernel and networking hardware interfaces. (HINT: it’s a bit of a
jungle we’ve helped grow!) We’ll summarize the different kinds of
networking products available from different vendors, and show the
workflows of how a user can use the network hardware
offloads/accelerations available and where there are still gaps. Of
particular interest to us is how to have a useful, generic hardware
offload supporting infrastructure (with seamless software fallback!)
within the kernel, and we’ll explain the differences between deploying
an eBPF program that can run in software, and one that can be
offloaded by a programmable ASIC based NIC. We will discuss our
analysis of the cost of an offload, and when it may not be a great
idea to do so, as hardware offload is most useful when it achieves the
desired speed and requires no special software (kernel changes). Some
other topics we will touch on: the programmability exposed by smart
NICs is more than that of a data plane packet processing engine and
hence any packet processing programming language such as eBPF or P4
will require certain extensions to take advantage of the device
capabilities in a holistic way. We’ll provide a look into the future
and how we think our customers will use the interfaces we want to
provide both from our hardware, and from the kernel. We will also go
over the matrix of most important parameters that are shaping our HW
designs and why.
Today every packet which is reaching Facebook’s network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I’m going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I’m going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.
Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:
Lessons which we have learned during operation of XDP:
Missing parts: what and why could be added:
Currently the Linux kernel implements two distinct datapaths for Open
vSwitch: the ovskdp and the TC DP. The latter has been added recently
mainly to allow HW offload, while the former is usually preferred for
SW based forwarding due to functional and performance reasons.
We evaluate both datapaths in a typical forwarding scenario - the PVP
test - using the perf tool to identify bottlenecks in the TC SW dp.
While similar steps usually incur in similar costs, the TC SW DP
requires an additional, per packet, skb_clone, due to a TC actions
We propose to extend the existing act infrastructure, leveraging the
ACT_REDIRECT action and the bpf redirect code, to allow clone-free
forwarding from the mirred action and then re-evaluate the datapaths
performances: the gap is then almost already closed.
Nevertheless, TC SW performance can be further improved by completing
the RCU-ification of the TC actions and expanding the recent
listification infrastructure to the TC (ingress) hook. We plan also to
compare the TC/SW datapath with an custom eBPF program implementing the
equivalent flow set to tag a reference value for the target
eBPF (extended Berkeley Packet Filter) has been shown to be a flexible
kernel construct used for a variety of use cases, such as load balancing,
intrusion detection systems (IDS), tracing and many others. One such
emerging use case revolves around the proposal made by William Tu for
the use of eBPF as a data path for Open vSwitch. However, there are
broader switching use cases developing around the use of eBPF capable
hardware. This talk is designed to explore the bottlenecks that exist in
generalising the application of eBPF further to both container switching as
well as physical switching.
Topics that will be covered include proposals for container isolation through
the use of features such as programmable RSS, the viability of physical
switching using eBPF capable hardware as well as integrations with other
subsystems or additional helper functions which may improve the possible
Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.
For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.
We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.
Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.
Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.
We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.
Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena. Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel. Other areas where there is ongoing work include the low memory killer, dynamically-allocated Binder devices, kernel namespaces, EAS, userdata FS checkpointing and DT.
The working planning document is here:
Being a traditional tool with a long history, strace has been making every effort to overcome various deficiencies in the kernel API. Unfortunately, some of these workarounds are fragile, and in some cases no workaround is possible. In this talk maintainers of strace will describe these deficiencies and propose extensions to the kernel API so that tools like strace could work in a more reliable way.
Problem: there is no kernel API to find out whether the tracee is entering or exiting syscall.
Current workarounds: strace does its best to sort out and track ptrace events, this works in most cases, but in case of strace attaching to a tracee being inside exec when its first syscall stop is syscall-exit-stop instead of syscall-enter-stop, the workaround is fragile, and in infamous case of int 0x80 on x86_64 there is no reliable workaround.
Proposed solution: extend the ptrace API with
Problem: there is no kernel API to invoke
wait4 syscall with changed signal mask.
Current workarounds: strace does its best to implement a race-free workaround, but it is way too complex and hard to maintain.
Proposed solution: add
wait6 syscall which is
wait4 with additional signal mask arguments, like
Problem: time precision provided by
struct rusage is too low for
strace -c nowadays.
Current workarounds: none.
Proposed solution: when adding
wait6 syscall, change
struct rusage argument to a different structure with fields of type
struct timespec instead of
Problem: PID namespaces have been introduced without a proper kernel API to translate between tracer and tracee views of pids. This causes confusion among strace users, e.g. https://bugzilla.redhat.com/1035433
Current workarounds: none.
Proposed solution: add
translate_pid syscall, e.g. https://lkml.org/lkml/2018/7/3/589
Problem: there are no consistent declarative syscall descriptions, this forces every user to reinvent its own wheel and catch up with the kernel.
Current workarounds: a lot of manual work has been done in strace to implement parsers of all syscalls. Some of these parsers are quite complex and hard to test. Other projects, e.g. syzkaller, implement their own representation of syscall ABI.
Proposed solution: provide declarative descriptions for all syscalls consistently.
Formal methods have a reputation of being difficult, accessible mostly to academics and of little use to the typical kernel hacker. This talk aims to show how, without "formal" training, one can use such tools for the benefit of the Linux kernel. It will introduce a few formal models that helped find actual bugs in the Linux kernel and start a discussion around future uses from modelling existing kernel implementation (e.g. cpu hotplug, page cache states, mm_user/mm_count) to formally specifying new design choices. The introductory examples are written in PlusCal (an algorithm language based on TLA+) but no prior knowledge is required.
Providing a consistent and predictable performance experience for applications is an important goal for cloud providers. Creating isolated job domains in a multi-tenant shared environment can be extremely challenging. At Google, performance isolation challenges due to memory bandwidth has been on the rise with newer workloads. This talk covers our attempt to understand and mitigate isolation issues caused by memory bandwidth saturation.
The recent Intel RDT support in Linux helps us both monitor and manage memory bandwidth use on newer platforms. However, it still leaves a large chunk of our fleet at risk of memory bandwidth issues. The talk covers three aspects of our isolation attempts:
We believe the problems and trends we have observed are universally applicable. We hope to inform and initiate discussion around common solutions across the community.
Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. In this talk, Daniel Xu will cover why the Linux kernel OOM killer is surprisingly ineffective and how oomd, a newly opensourced userspace OOM killer, does a more effective and reliable job. Not only does the switch from kernel space to userspace result a more flexible solution, but it also directly translates to better resource utilization. His talk will also do a deep dive into the Linux kernel changes and improvements necessary for oomd to operate.
The GNU Toolchain and Clang/LLVM play a critical role at the nexus of the Linux Kernel, the Open Source Software ecosystem, and computer hardware. The rapid innovation and progress in each of these components requires greater cooperation and coordination. This Toolchain Microconference will explore recent developments in the toolchain projects, the roadmaps, and how to address the challenges and opportunities ahead as the pace of change continues to accelerate.
BPF is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP, tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.
BPF has seen a lot of progress since last year's Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one.
This year's BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year's event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.
Official BPF MC website: http://vger.kernel.org/lpc-bpf.html
Google servers classify, measure, and shape their outgoing traffic. The original implementation is based on Linux kernel traffic control (TC). As server platforms scale so does their network bandwidth and number of classified flows, exposing scalability limits in the TC system - specifically contention on the root qdisc lock.
Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and low-overhead prequeue ameliorate the contention up to a point. But they cannot fully resolve it. Recent changes to the Linux kernel make it possible to move classification, measurement, and packet mangling outside this critical section, potentially scaling to much higher rates while simultaneously shaping more flows and applying more flexible policies.
By moving classification and measurement to BPF at the new TC egress hook, servers avoid taking a lock millions of times per second. Running BPF programs at socket connect time with TCP_BPF converts overhead from per-packet to per-flow. The programmability of BPF also allows us to implement entirely new functions, such as runtime configurable congestion control, first-packet classification and socket-based QoS policies. It enables faster deployment cycles and as this business logic can be updated dynamically from a user agent. The discussion will focus on our experience converting an existing traffic shaping system to a solution based on BPF, and the issues we’ve encountered during testing and debugging.
Compile-once and run-everywhere can make deployment simpler and may consume less resources on the target host, e.g., without llvm compiler and kernel devel package. Currently bpf programs for networking can compile once and run over different kernel versions. But bpf programs for tracing cannot since it accesses kernel internal headers and these headers are subject to change between kernel versions.
But compile-once run-everywhere for tracing is not easy. BPF programs could access anything in the kernel headers, including data structures, macros and inline functions. To achieve this goal, we need (1) preserving header-level accesses for the bpf program, and (2) abstracting header info of vmlinux. Right before program load on the target host, some kind of resolution is done for bpf program against the running kernel so the resulted program is just like to that compiled against host kernel headers.
In this talk, we will explore how BTF could be used by both bpf program and vmlinux to advance the possibility of bpf program compile-once and run-everywhere.
BPF program writers today who build and distribute their programs as ELF objects typically write their programs using one of a small set of (mostly) similar headers that establish norms around ELF section definitions. One such norm is the presence of a "maps" section which allows maps to be referenced within BPF instructions using virtual file descriptors. When a BPF loader (eg, iproute2) opens the ELF, it loads each map referred in this section, creates a real file descriptor for that map, then updates all BPF instructions which refer to the same map to specify the real file descriptor. This allows symbolic referencing to maps without requiring writers to implement their own loaders or recompile their programs every time they create a map.
This discussion will take a look at how to provide similar symbolic referencing for static data. Existing implementations already templatize information such as MAC or IP addresses using C macros, then invoke a compiler to replace such static data at load time, at a cost of one compilation per load. By extending the support for static variables into ELF sections, programs could be written and compiled once then reloaded many times with different static data.
Currently, BPF can not support basic loops such as for, while, do/while, etc. Users work around this by forcing the compiler to "unroll" these control flow constructs in the LLVM backend. However, this only works up to a point. Unrolling increases instruction count and complexity on the verifier and further LLVM can not easily unroll all loops. The result is developers end up writing code that is unnatural, iterating until they find a version that LLVM will compile into a form the verifier backend will support.
We developed a verifier extension to detect bounded loops here,
This requires building a DOM tree (computationally expensive) and then matching loop patterns to find loop invariants to verify loops terminate. In this discussion we would like to cover the pros and cons of this approach. As well as discuss another proposal to use explicit control flow instructions to simplify this task.
The goal of this discussion would be to come to a consensus on how to proceed to make progress on supporting bounded loops.
eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyser will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
Complex software usually depends on many different components, which sometimes perform background tasks with side effects not directly visible to their users. Without proper tools it can be hard to identify which component is responsible for performance hits or undesired behaviors.
We were challenged to implement D-Bus observability tools in embedded, ARM32 or ARM64 kernel based environments, both with 32-bit userspace. While we found bcc-tools, an open source compiler set useful, it appeared that it lacks support for 32-bit environments. We extended bcc-tools with support for 32-bit architectures. Using bcc-tools we created Linux eBPF programs – small programs written in a subset of C language, loaded from user-space and executed in kernel context. We attached them to uprobes and kprobes - user and kernel space special kinds of breakpoints. While it worked on ARM32 kernel based system, we faced another problem - ARM64 kernel lacked support for uprobes set in 32-bit binaries. The 64-bit ARM Linux kernel was extended with the ability to probe 32-bit binaries.
We propose to discuss challenges we faced trying to implement bcc-tools based tracing tools on ARM devices. We present a working solution to overcome lack of support for 32-bit architectures in bcc-tools, leaving space for discussion about other ways to achieve the same result. We also introduce 32-bit instruction probing in ARM64 kernel - a solution that we found very useful in our case. As a proof of concept we present tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system with 32-bit userspace. We list what needs to be done for complete eBPF-based tools to be fully usable on ARM.
eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual machine, which can be used to execute simple programs injected by the user at various hooks in the kernel, on the occurrences of events such as incoming packets. eBPF was designed to simplify the work of in-kernel just-in-time compilers, i.e. translation of eBPF intermediate representation to CPU machine code. Upstream Linux kernel currently contains JITs for all major 64-bit instruction set architectures (ISAs) (x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit translators (ARM, x86, also NFP - Netronome Flow Processor).
The eBPF generic virtual machine with clearly defined semantics makes it a very good vehicle for enabling programming of custom hardware. From storage devices to networking processors most host I/O controllers today are built based on or with accompaniment with general purpose processing cores, e.g. ARM. As vendors try to expose more and more capabilities of their hardware, using a general purpose machine definition like eBPF to inject code into hardware directly allows us to avoid creation of vendor specific APIs.
In this talk I will describe the eBPF offload mechanism which exists today in the Linux kernel and how they compare to other offloading stacks e.g. for compute or graphics. I will present a proof-of-concept work on of reusing existing eBPF JITs for non-host architecture (e.g. ARM JIT on x86) to program a emulated device, followed by a short description of the eBPF offload for NFP hardware as an example of a real-life offload.
eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.
The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.
Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.
Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers. Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.
Deep packet inspection seems to be a largely unexplored area of BPF use cases. The 4096 instruction limit and the lack of loops make such implementations non-straightforward for many protocols. Using XDP and socket filters, at Red Sift, we implemented DNS and TLS handshake detection to provide better monitoring for our clusters. We learned that while the protocol implementation is not necessarily straightforward, the BPF VM provides a reasonably safe environment for DPI-style parsing. When coupled with our Rust userspace implementation, it can provide information and functionality that previously would have required userspace intercepting proxies or middleboxes, at a comparable performance to iptables-style packet filters. Further work is needed to explore how we can turn this into a more comprehensive, active component, mainly due to the BPF VM restrictions around 4096 instruction programs.
BPF trace tools such as bcc/trace and bpftrace can attach to Systemtap USDT (user application statically defined tracepoints) probes. These probes can be created by a macro imported from "sys/sdt.h" or by a provider file. Either way, Systemtap will register those probes as entries in the note section of the ELF file with the name of the probe, its address and the arguments as assembly locations. This approach is fairly simple, easy to parse and non-intrusive. Unfortunately, it is also obsolete and lacks features such as typed arguments and built-in dynamic instrumentation. Since BPF tools are growing in popularity, it makes sense to create a new enhanced format to fix these shortcomings.
We can discuss and make decisions about the future of USDT probes used by BPF trace tools. Some possible alternatives are: extend Systemtap USDT to introduce these new features or extend kernel tracepoints so that user applications can also register them.
The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based 'strace' like tool, augmenting the syscall arguments provided by the tracepoints with integer->strings tables automatically generated from the kernel headers, showing the paths associated with fds, pid COMMs, etc.
That is enough for integer arguments, pointer arguments needs either kprobes put in special locations, which is fragile and has been so far implemented only for getname_flags (open, etc filenames), or using eBPF to hook into the syscall enter/exit tracepoints to collect pointer contents right after the existing tracepoint payload.
This has been done to some extent and is present in the kernel sources in the tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in perf to use BPF C programs as event names, automagically using clang/llvm to build and load it via sys_bpf(), 'perf trace' hooks this to the existing beautifiers that seeing that extra data use it to get the filename, struct sockaddr_in, etc.
This was done for a bunch of syscalls, what is left is to get this all automated using BTF, allow passing filters attached to the syscalls, select which syscalls should be traced, use a pre-compiled augmented_syscalls.c just selecting what bits of the obj should be used, etc, i.e. the open issues about this streamlining process to avoid requiring the clang toolchain, etc will be the matter of this discussion.
The physical memory management in the Linux kernel is mostly based on single page allocations, but there are many situations where a larger physically continuous memory needs to be allocated. Some are for the benefit of userspace (e.g. huge pages), others for better performance in the kernel (SLAB/SLUB, networking, and others).
Making sure that contiguous physical memory is available for allocation is far from trivial, as pages are reclaimed for reuse roughly in last-recently-used (LRU) order, which is typically different from their physical placement. The freed memory is thus fragmented. The kernel has two complementary mechanisms to defragment the free memory. One is memory compaction, which migrates used pages to make the free pages contiguous. The other is page grouping by mobility, which tries to make sure that pages that cannot be migrated are grouped together, so the rest of pages can be effectively compacted. Both mechanisms employ various heuristics to balance the success of large allocations, and their overhead in terms of latencies due to processor and lock usage.
The talk will discuss the two mechanisms, focusing on the known problems and their possible solutions, that have been proposed by several memory management developers.
WireGuard   is a new network tunneling mechanism written for
Linux, which, after three years of development, is nearly ready for
upstream. It uses a formally proven cryptographic protocol, custom
tailored for the Linux kernel, and has already seen very widespread
deployment, in everything from smart phones to massive data center
clusters. WireGuard uses a novel timer mechanism to hide state from
userspace, and in general presents userspace with a "stateless" and
"declarative" system of establishing secure tunnels. The codebase is
also remarkably small and has been written with a number of defense in
depth techniques. Integration into the larger Linux ecosystem is
advancing at a health rate, with recent patches for systemd and
NetworkManager merged. There is also ongoing work into combining
WireGuard with automatic configuration and mesh routing daemons on
Linux. This talk will focus on a wide variety of WireGuard’s innards
and tentacles onto other projects. The presentation will walk through
WireGuard's integration into the netdev subsystem, its unique use of
network namespaces, why kernel space is necessary is necessary, the
various hurdles that have gone into designing a cryptographic protocol
specifically with kernel constraints in mind. It will also examine a
practical approach to formal verification, suitable for kernel
engineers and not just academics, and connect the ideas of that with
our extensive continuous integration testing framework across multiple
kernel architectures and versions. As if that was not already enough,
we will also take a close look at the interesting performance aspects
of doing high throughput CPU-bound computations in kernel space while
still keeping latency to a minimum. On the topic of smartphones, the
talk will examine power efficiency techniques of both the
implementation and of the protocol design, our experience in
integrating this into Android kernels, and the relationship between
cryptographic secrets and smartphones suspend cycles. Finally we will
look carefully at the WireGuard userspace API and its usage in various
daemons and managers. In short, this presentation will examine the
networking and cryptography design, the kernel engineering, and the
userspace integration considerations of WireGuard.
Lockdep (the deadlock detector in the Linux kernel) is a powerful tool to detect deadlocks, and has been used for a long time by kernel developers. However, when comes to read/write lock deadlock detections, lockdep only has limited support. Another thing makes this limited support worse is some major architectures (x86 and arm64) has switched or is trying to switch its rwlock implementation to queued rwlock. One example is we found some deadlock cases that happened in kernel but we could not detect it with lockdep.
To improve this situation, a patchset to support read/write deadlock detection in lockdep has been post to lkml and got to its v6. Althrough it got several positive feedbacks, some details about the reasoning of the correctness and other things still need more discussion.
This topic will give a brief introduction on rwlock related deadlocks (recursive read deadlocks) and how we can tweak lockdep to detect them. It will focus on the detection algorithm and its correctness, but also some implementation details.
This topic will provide the opportunity to discuss the reasoning and the overall design with some core lock developers, along with the opportunity to discuss the usage scenarios with potential users. The expected result is we have a cleaner plan on upstreaming this and more developers get educated on how to use this to help their work.
Most modern microprocessors employ complex instruction execution pipelines such that many instructions can be 'in flight' at any given point in time. The efficiency of this pipelining is typically measured in how many instructions get completed per CPU cycle and the metric gets variously called as Instructions Per Cycle (IPC) or the inverse metric Cycles Per Instruction (CPI). Various factors affect this metric and hazards are the primary among them. Different types of hazards exist - Data hazards, Structural hazards and Control hazards. Data hazard is the case where data dependencies exist between instructions in different stages in the pipeline. Structural hazard is when the same processor hardware is needed by more than one instruction in flight at the same time. Control hazards are more the branch misprediction kinds. Information about these hazards are critical towards analyzing performance issues and also to tune software to overcome such issues. Modern processors export such hazard data in Performance Monitoring Unit (PMU) registers. In this talk, we propose an arch neutral extension to perf to export the hazard data presented in different ways by different architectures. We also present how this extension has been applied to the IBM Power processor, the APIs and example output.
The focus will be on power management frameworks, task scheduling in relation to power/energy optimization, and platform power management mechanisms. The goal is to facilitate cross framework and cross platform discussions that can help improve power and energy-awareness in Linux.
Remote DMA Microconference
Opening RDMA session with agenda, announcements and some statistics from last year.
RDMA, DAX and persistant memory co-existence.
Explore the limits of what is possible without using On
Demand Paging Memory Registration. Discuss 'shootdown'
of userspace MRs
Dirtying pages obtained with get_user_pages() can oops ext4
discuss open solutions.
The momentum behind RISC-V ecosystem is really commendable and its open nature has a large role in its growth. It allowed contributions from both academic and industry community leading to an unprecedented number of hardware designs proposals in a very short span of time. Soon, a wider variety of RISC-V based hardware boards and extensions
will be available, allowing a larger choice of applications not limited to embedded micro-controllers. RISC-V software ecosystem also need to grow across the stack so that RISC-V can be a true alternative to existing ISA. Linux kernel support holds the key in this.
The primary objective of the RISC-V track at Plumbers to initiate a community wide discussion about the design problems/ideas for different Linux kernel features implemented or to be implemented. We believe that this will also result in significant increase in active developer participation in code review/patch submissions which will definitely lead to a better & stable kernel for RISC-V.
The eXpress Data Path (XDP) has been gradually integrated into the Linux kernel over several releases. XDP offers fast and programmable packet processing in kernel context. The operating system kernel itself provides a safe execution environment for custom packet processing applications, in form of eBPF programs, executed in device driver context. XDP provides a fully integrated solution working in concert with the kernel's networking stack. Applications are written in higher level languages such as C and compiled via LLVM into eBPF bytecode which the kernel statically analyses for safety, and JIT translates into native instructions. This is an alternative approach, compared to kernel bypass mechanisms (like DPDK and netmap).
This talk gives a practical focused introduction to XDP. Describing and giving code examples for the programming environment provided to the XDP developer. The programmer need to change their mindeset a bit, when coding for this XDP/eBPF execution environment. XDP programs are often split between eBPF-code running kernel side and userspace control plane. The control plane API not predefined, and is up to the programmer, through userspace manipulating shared eBPF maps.
The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.
Over the past few years the graphics subsystem has been spearheading experiments in running things differently: Pre-merge CI wrapped around mailing lists using patchwork, committer model as a form of group maintainership on steroids, and other things. As a result the graphics people have run into some interesting new corner cases of the kernel's "patches carved on stone tablets" process.
On the other hand the freedesktop.org project, which provides all the server infrastracture for the graphics subsystem, is undergoing a big reorganization of how they provide their services. The biggest change is migrating all source hosting over to a gitlab instance.
This talk will go into the why of these changes and detail what is definitely going to change, and what is being looked into more as experiments with open outcomes.
The main purpose of the Linux Plumbers 2018 Live kernel patching miniconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make Live patching of the Linux Kernel (more or less) feature complete.
The main purpose of the proposed miniconference is focusing on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.
Remote DMA Microconference