- Real Time Microconference Notes Welcome to Linux Plumbers Conference 2014. Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session. Please remember there is no video this year, so your notes are the only record of your microconference. -Jan Kiszka (Siemens) Virtualization and Real-Time Linux http://www.linuxplumbersconf.org/2014/ocw/sessions/1935 T - Use cases Paul McKenney - using a real-time host, and non-real-time guests - Nicholas McQuire - Achievable jitter with hypervisor experience not that different than without a hypervisor - real-time hypervisor's have been around for a long time, - seems to be a need for this kind of thing, but not necessarily a good solution yet - audience member discussing a use case with a real-time host and a real-time guest and measuring the results using cyclictest - would like to improve the cyclictest results - audience member - 1000 microsecond deadline usecase - why use a vm at all? - requested by the customers of the speaker., possibly for isolation Peter Zilstra points out that we sometimes need to tell customers that they are wrong in their choices - Jan starts to talk about how to configure systems for virtualization - a number of threads need to be configured correctly, (with various priorities) - mentions some difficulties where he needed to properly prioritize rcu priorites - Paul mentions the need to confine them to cpu-cores - Jan asks how to figure out which threads require which priorites, what are involved and so on, - basically saying that we require the proper tooling. - Thomas with his typical sense of humour says that real-time is hard of course, and it should be, other\wise everyone would use it :) - Thomas talks about the fact that you require not just the typical kernel hacking, but also QEMU hacking. - Nicholas asks what size of systems would be targetted for this virtualization approach - Thomas talks about difficult use cases in which people want both fully isolated systems and still have the ability to do things like add cpus on the fly - one of the use cases that people dream of is for reliability, for say cloud computing -Jan points out that the real-time aspects are usually local since we don't have rt network responses. - Thomas asks the audience what their target is for virtuatlization. - do you want a more dynamic type of configuration in which you can use different numbers of cpus and so on each day, or more isolated configurations? - one of the small problems is to improve KVM (rt) performance - Nicholas asks what is going on with the tooling right now - many of the current toolings are not for the rt use cases now - Jan - any kind of tracing - configurating interrupt management properly - proper priorites and affinities for interrupts are important. Just making these kind of things more configurarable is important. - Both Jan and Thomas emphasize the need for good documentation - Thomas says you need to understand the whole software stack in order to make good use of the tools - There is probably a need at some point for tools targetted at users without the deeper understanding who are newer to these topics. (real-time and virtualization) Topic 2 - Long-Term Testing Speaker: Carsten Emde - Manager of the Open Source Automation Development Lab (OSADL) - The scope has been extended to embedded systems and more general systems - (Note that Carsten has some slides available - that you should refer to before the discussion portion of the talk) - discusses testing with identical hardware and applications, and various kernel versions, down to particular patches - uptime recordings - leading to bugfixes Carsten asks how many of us have visited the OSADL website and had a look at the farm - with measurements. - note: there is a large varieties of processors old and new, and various architectures that are tested.Single outlyers can be detected in billions of operations. - Thomas points out that we don't have the capacity to examine the large amount of data that is produced by OSADL. Analysing the crashes, rt-perf outlyers. We require someone to analyse them, produce bug reports. We don't necessarily require a kernel expert, but someone who is fairly knowledgable of course. - Carsten says that anyone could access the data and analyse it, - perhaps in the future OSADL could hire a technician to analyse the data, but there isn no particular person who currently does this (other than Carsten himself - who spends up to two hours a day doing so) - One or two hours a day is enough to maintain the farm but not for the deeper analysis. Not an unsurmountable problem. - Paul McKenney asks what the farm's favourite kernel version is - by looking at the worst outlyers for example. - Carsten says that we monitor the systems running various kernels, and 1/3 of the systems are running the kernels requested by customers - 2/3's of the systems are running kernel's chosen by OSADL - v3.12-rt is highly used, some v3.14-rt - Carsten finds is interesting to examine both current kernels and older versions. Not too far back, v3.x-rt - Also in the slides, Carsten explains how to find the "hall of shame" at the OSADL website - which is information on kernels that have crashed. An important question is to discover which bugs are rt and which are vanilla upstream bugs uncovered by rt testing. - Thomas asks how to connect all of this testing in the long run to other testing efforts. - Audience member points out the need to do things such as bisect problems. - Thomas asks what to do with LONG-TERM testing though as opposed to a simple kernel crash. - Paul asks if there are other efforts that do long term testing that lasts say a year or more. - Carsten points out that we see very few bug reports on the rt-linux mailing list. We don't have a tradition of people doing quality report testing. Should we use bugzilla? -Thomas is of the opinion that we should not, it would just be a dumping ground of not necessarily useful information. - Paul points out that rt often makes mainline bugs more obvious. - Carsten gives an example where a beagle board crashed with rt every 4 to 6 hours, and then he tried to reproduce the bug with a mainline kernel, and was able to do it, but it took more like 48 hours to reproduce. - Thomas is surprised that Enterprise doesn't do more long-term testing - (John) - our testing tends to be more like 24 or 48 hour testing - we actually do week long testing as well - Carsten points out that certain problems are not uncovered unless you do this long term kind of testing. - Carsten points out that these are sometimes economic decisions. Topic 3: Current state of full dynticks (nohz_full) Speaker: Frederic Weisbecker Note: Slides are also available (refer to for the pre-discussion portion) Discussion: - [fill in here, thomas' use case for rcu_nocb] - HPC people care about keeping the cache busy - work load dependent - bmouring: One usecase (similar to the baremetal net case) is a hard-polling, 100% use rtprio task preventing a core from taking care of housekeeping tasks - results from customers of ours who grew acustomed to being able to do this from a single-mode RTOS, basically from-app (userspace on linux) polling - Paul says that HPC people often use 10HZ - run_queue_lock serializes the world - the work to isolate cpus for no_hz has also been useful for isolating cpus for power management. - Peter Zilstra points out there is high overhead for this. For example system calls - some of the overhead is in the measurement - is it possible to have less accurate measurements to gain speed? - Thomas points out that you can update the information in the scheduler, avoiding the system call overhead of doing this in user space. - Kevin Hilman commented on power management usecase: - all of the wok for NOHZ_FULL has made NOHZ_IDLE even better for PM use-cases - without having to turn on NOHZ_FULL and incur the overhead - the overhead of the NOHZ_FULL is the accounting that is needed for the context switch between userspace and the kernel - The idle case (NOHZ - simple) does not require this, but the offloading from the CPU that is used by NOHZ_FULL allows for the same offloading from idle CPUs - because of all the isolation work, CPUs can stay idle for even longer - Could SMT be used to record the housekeeping? - usage case: running networks on bare metal - e.g. some CPUs running Linux, other CPUs dedicated to bare-metal networking task, polling - goal: use full NO_HZ to be able to run those as userspace threads - detecting idle states in some power domain (system in a cluster for example) - desire to use nohz with virtualization, but requires some work, for example kvm patches - Jan talks about exposing native APIC to virtual machine. - Thomas says that HPET should not be used for anything critical if not necessary - Amit Kucheria: Could we have a kernel api requesting that the cpu not be used? - Useful to wean mobile vendors off CPU Hotplug - Potentially useful for thermal management - Currently, we don't solve this - timers, workqueues might still disturb the CPU Topic 4: Following mainline with PREEMPT_RT (Slides available as usual) - Speaker Sebastian Siewior - Nicholas asks whether some of these well defined scenarios in mainline code that are problematic for rt could be written as cochinelle scanners? - Thomas has tried this, but found that it was not possible in practice, eg: trylocks need to be examined on a case by case basis. - Thomas says that locks such as local_irq_save act as BKL locks in a way, that they protect "random crap". RT changes this by naming the locks, offering a named scope. This annotation mechanism could be mainlined. - What if we wanted to protect two data structures? We don't want to call local_irq_save twice. Ans: we could come up with some kind of a nested locking scheme. - Thomas asks Sebastian if there was anything that he would do different? For example in mainline, that would make the rt effort easier? - Ans: consistent annotation, and notation in the mainline kernel - Carsten asks how much kernel testing is done with mainline preempt. - Would it help if more mainline kernel testers / distros tested with the mainline preempt? - Enterprise usually chooses voluntary preemption - Do mainline kernel developers cooperate or provide resistence when rt programmers point out code that causes problems for rt? - Mostly they are cooperative, but sometimes there are conflicting goals / scratching out the last bit of performance vs latency. - Often there are mutal benefits, increased annotation etc Topic: Read Write Semaphore bottlenecks (only 5-slides reproduced below) Speaker: Steven Rostedt Read Write Semaphores - slide 1 - Allows for multiple readers and only one writer - They are fair locks - New readers will block if a writer is blocked Read Write Semaphores - slide 2 - Real Time converts them to a simple mutex - Serializes readers - mainline can run parallel - Affects various work loads drastically - Note, mainline can be forced to serialize readers if a writer is blocked - Remember, they are fair locks Read Write Semaphores - slide 3 - Biggest culprit for performance issues - mmap_sem - Page fault - Lots of threads (Java!) - Peter Zilstra has worked to avoid taking mmap_sem on page faults - There may be other areas where rwsems are bad Read Write Semaphores - slide 4 - Priority inheritance is hard - Doig PI for multiple tasks is even harder - was done before and was really complex - Tried to keep the fast path - use cmpxchg() to grab lock quickly when uncontended - "train wreck!" (Thomas Gleixner quote) Read Write Semaphores - slide 5 - Revisit Priority Inheritance - Forget the fast path (rwsems such anyway) - Greatly simplifies the algorithm - All must take the internal spinlock before taking lock - But still complex, but reasonable. Discussion: - Thomas: if you boost multiple readers, then this is not a limited mechanism. - tree like fanout in the priority boosting path (which can be deep) - notes that the patch isn't as bad as it was before - Steve: says you could limit it. - Peter: says you could add a WARN_ON for recursive cases - Paul says you could boost serially, that is, one at a time instead of all readers at the same time - Steve says that is already the case - One writer has to get all of the readers out of the way first, Thomas elaborates on the serial version, where you just take one at a time from the reader list - this is the case for single processor anyway - potentially in SMP if you're lucky you might get the multiple readers running on individual cpus - the performance decreases we see with rt are usually in the non-rt workloads - Steve asks if anyone knows of scenarios in which non-rt workloads decrease dramatically with an rt kernel - Carsten says multiple channel-io slows down applications on multicore machines Topic: SIL2LinuxMP: GNU/Linux Multicore platform for safety related systems Speaker: Nicholas Mc Guire Discussion: - Nicholas makes the point that what we are checking is not the final code, but the process that creates the code - This keeps "random crap" out of the code - Nicholas calls the process of discussing what works and doesn't "safety culture" - which Linux does in effect have - Paul: what would be required to make this happen (safety critical) - Nicholas: we need to convince a certification authority about our process, select a version, and certify it. Topic: Avoiding some of the mmap_sem usage Speaker: Peter Zijlstra - mostly gone from various places such as futexes - currently a lot of work being done with page faults to remove mmap_sem usage - idea to do rcu like look up on vma tree to avoid use of the semaphore - Paul mentions that there has been some work showing substantial speed-ups even in mainline for removing this - Question: are there any problems with speculative look-up? - Peter: maintains a counter to check whether to invalidate the look-ups, the wrap-around occurs relatively infrequently, in the order of every 4-hrs - need to look at removing or reducing it for memory compaction during a THP page fault. (non-trivial) - Peter measured a 124% performance increase with a scientific benchmark from (probably sgi), in mainline (non-rt). https://etherpad.fr/p/LPC2014_RealTime (Audience can participate in the note taking!!!) if speakers could either post their slides and send me the URL , or send me the slides that would be appreciated - jkacur@gmail.com Topic: RT Status report Speaker: Thomas Gleixner - Discussed ways to reduce complexity in the rt-patch by pushing various parts upstream. - Thomas tried to do the simple thing of doing one thing at a time, but ended up with more ten parts with interdependencies, that make it hard to push various components separately upstream - locking is difficult for rt due to the unnamed locking as mentioned earlier, that acts somewhat like the BKL. - hard to understand what locks are protecting - people were asking how they can help. - difficult to assign jobs to people, not the way that kernel work is done - find something you find interesting, something that's small, send patches, listen to reviews, etc - the important thing to do is to read the rt-patch and try to understand it. - a few things were disabled due to complex locking, the normal approach is to disable it, apply the patch, enable small pieces, and watch what explodes. - Thomas gave a status report on Monday at the RTLWS - successful part of rt - it is possible to turn linux (general purpose OS) into a real-time OS - won out over other approaches - various questions, and attempts in the past to do things like budgeting, - long list of successful code that got into the mainline kernel such as high resolution timers, priority inheritance, generic IRQs - lock dependency validator - removed many bugs in locking in the mainline kernel - disaster side - US Navy [which contributed a lot of the initial money] had a clause in their contract that there would be a best effort to move the work upstream (good), but not everything went upstream. - monetary part of rt has dried-up - rt is in use, in various products, but there doesn't seem to be industry interest in contributing funding for further development - non-rt use case from large companies like Apache, and companies that provide services understand the value of supporting Linux - rt use cases, embedded spaces, automobilies, automation, place more value in the hardware that they are building - nobody expects a single company to pay for everything, but it would be nice if a lot of small companies would come together to contribute funding. - Thomas describes this as a Mikado situation, where whomever moves first - loses. (they pay) - Thomas has put the project on "hobby status". - There are efforts at organizations such as Linux Foundation, and OSADL to continue the work, find funding, etc. - The longer this drags out the harder it will be to restart it because the gap between mainline and rt will widen. - New mainline technologies are often harmful to rt, and when the patch is out of tree, it's more difficult to identify these. - Linus has no problem to integrate various bits and pieces if they are well done. Needs to see long term interest in maintaining this though. Q: is the hard part just finding maintainence? A: No, the hard part is getting it integrated in the first place, the 2nd hard part is finding long term maintainence. Getting it integrated requires clean-ups of both the rt-patch, and of the mainline code. Thomas: typically how we've gotten rt code in the mainline previously was to give Linus something he wanted, along with our code. Carsten: step one is to find funding to get to the level of being able to just create the next kernel versions (typically even numbered kernels). - points out that the more we upstream code, the lower the maintaince effort is, but never zero Q: is CII (Core Infrastructure Initiative) something we could join? A: possibly - How many members do we need to sustain this? Carsten (OSADL), was hoping to have 100 members, and 2 dedicated engineers. - the number is not insurmountable. - what kind of benefit can people get by giving money to OSADL such as training? - Carsten answers, legal advice, access to QA farm, free entrance to conferences, marketing services, networking of employees, etc - Peter Z: the more you contribute to these "gimmicks", the less money there is for funding rt, development - Carsten: not necessarily, because we pay once for the services, but the more people contribute the easier this is, every new member benefits from work that they provide once