-
Real Time Microconference Notes

Welcome to Linux Plumbers Conference 2014.

Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference.


-Jan Kiszka (Siemens) Virtualization and Real-Time Linux
http://www.linuxplumbersconf.org/2014/ocw/sessions/1935
T
- Use cases 
Paul McKenney
	- using a real-time host, and non-real-time guests

- Nicholas McQuire
	- Achievable jitter with hypervisor experience not that different than without a hypervisor
	- real-time hypervisor's have been around for a long time,
	- seems to be a need for this kind of thing, but not necessarily a good solution yet

- audience member discussing a use case with a real-time host and a real-time guest and measuring the results using cyclictest
	- would like to improve the cyclictest results
- audience member
	- 1000 microsecond deadline usecase
		- why use a vm at all?
	- requested by the customers of the speaker., possibly for isolation
	 Peter Zilstra points out that we sometimes need to tell customers that they are wrong in their choices

- Jan starts to talk about how to configure systems for virtualization
	- a number of threads need to be configured correctly, (with various priorities)
	- mentions some difficulties where he needed to properly prioritize rcu priorites
	- Paul mentions the need to confine them to cpu-cores
- Jan asks how to figure out which threads require which priorites, what are involved and so on,
	- basically saying that we require the proper tooling.
- Thomas with his typical sense of humour says that real-time is hard of course, and it should be, other\wise everyone would use it :)
- Thomas talks about the fact that you require not just the typical kernel hacking, but also QEMU hacking.

- Nicholas asks what size of systems would be targetted for this virtualization approach
- Thomas talks about difficult use cases in which people want both fully isolated systems and still have the ability to do things like add cpus on the fly
	- one of the use cases that people dream of is for reliability, for say cloud computing
	
-Jan points out that the real-time aspects are usually local since we don't have rt network responses.

- Thomas asks the audience what their target is for virtuatlization.
	- do you want a more dynamic type of configuration in which you can use different numbers of cpus and so on each day, or more isolated configurations?

- one of the small problems is to improve KVM (rt) performance
- Nicholas asks what is going on with the tooling right now
	- many of the current toolings are not for the rt use cases now
- Jan
	- any kind of tracing
	- configurating interrupt management properly
	- proper priorites and affinities for interrupts are important. Just making these kind of things more configurarable is important.
	
- Both Jan and Thomas emphasize the need for good documentation
- Thomas says you need to understand the whole software stack in order to make good use of the tools

- There is probably a need at some point for tools targetted at users without the deeper understanding who are newer to these topics. (real-time and virtualization)

Topic 2 - Long-Term Testing

	Speaker: Carsten Emde - Manager of the Open Source Automation Development Lab (OSADL)
	- The scope has been extended to embedded systems and more general systems
	
	- (Note that Carsten has some slides available - that you should refer to before the discussion portion of the talk)
	- discusses testing with identical hardware and applications, and various kernel versions, down to particular patches
	- uptime recordings - leading to bugfixes
	
	Carsten asks how many of us have visited the OSADL website and had a look at the farm - with measurements.
	- note: there is a large varieties of processors old and new, and various architectures that are tested.Single outlyers can be detected in billions of operations.
	
	- Thomas points out that we don't have the capacity to examine the large amount of data that is produced by OSADL. Analysing the crashes, rt-perf outlyers. We require someone to analyse them, produce bug reports. We don't necessarily require a kernel expert, but someone who is fairly knowledgable of course.
	- Carsten says that anyone could access the data and analyse it, - perhaps in the future OSADL could hire a technician to analyse the data, but there isn no particular person who currently does this (other than Carsten himself - who spends up to two hours a day doing so)
	- One or two hours a day is enough to maintain the farm but not for the deeper analysis. Not an unsurmountable problem.
	- Paul McKenney asks what the farm's favourite kernel version is - by looking at the worst outlyers for example.
	- Carsten says that we monitor the systems running various kernels, and 1/3 of the systems are running the kernels requested by customers
	- 2/3's of the systems are running kernel's chosen by OSADL - v3.12-rt is highly used, some v3.14-rt
	- Carsten finds is interesting to examine both current kernels and older versions. Not too far back, v3.x-rt
	
	- Also in the slides, Carsten explains how to find the "hall of shame" at the OSADL website  - which is information on kernels that have crashed. An important question is to discover which bugs are rt and which are vanilla upstream bugs uncovered by rt testing.
	
	- Thomas asks how to connect all of this testing in the long run to other testing efforts.
	
	- Audience member points out the need to do things such as bisect problems.
	- Thomas asks what to do with LONG-TERM testing though as opposed to a simple kernel crash.
	- Paul asks if there are other efforts that do long term testing that lasts say a year or more.
	
	- Carsten points out that we see very few bug reports on the rt-linux mailing list. We don't have a tradition of people doing quality report testing. Should we use bugzilla?
	
	-Thomas is of the opinion that we should not, it would just be a dumping ground of not necessarily useful information.
	
	- Paul points out that rt often makes mainline bugs more obvious.
	
	- Carsten gives an example where a beagle board crashed with rt every 4 to 6 hours, and then he tried to reproduce the bug with a mainline kernel, and was able to do it, but it took more like 48 hours to reproduce.
	
	- Thomas is surprised that Enterprise doesn't do more long-term testing
	- (John) - our testing tends to be more like 24 or 48 hour testing
		- we actually do week long testing as well
	- Carsten points out that certain problems are not uncovered unless you do this long term kind of testing.
	- Carsten points out that these are sometimes economic decisions.
	
	Topic 3: Current state of full dynticks (nohz_full)
		Speaker: Frederic Weisbecker
	Note: Slides are also available (refer to for the pre-discussion portion)
	
	Discussion: 
	- [fill in here, thomas' use case for rcu_nocb]
	- HPC people care about keeping the cache busy
		- work load dependent
	- bmouring: One usecase (similar to the baremetal net case) is a hard-polling, 100% use rtprio task preventing a core from taking care of housekeeping tasks
	   - results from customers of ours who grew acustomed to being able to do this from a single-mode RTOS, basically from-app (userspace on linux) polling
	- Paul says that HPC people often use 10HZ
	- run_queue_lock serializes the world
	- the work to isolate cpus for no_hz has also been useful for isolating cpus for power management.
	- Peter Zilstra points out there is high overhead for this. For example system calls
	- some of the overhead is in the measurement
	- is it possible to have less accurate measurements to gain speed?
	- Thomas points out that you can update the information in the scheduler, avoiding the system call overhead of doing this in user space.
	
	- Kevin Hilman commented on power management usecase:
	  - all of the wok for NOHZ_FULL has made NOHZ_IDLE even better for PM use-cases
	  - without having to turn on NOHZ_FULL and incur the overhead
			- the overhead of the NOHZ_FULL is the accounting that is needed for the context switch between userspace and the kernel
			- The idle case (NOHZ - simple) does not require this, but the offloading from the CPU that is used by NOHZ_FULL allows for the same offloading from idle CPUs
	  - because of all the isolation work, CPUs can stay idle for even longer
	
	- Could SMT be used to record the housekeeping?
	- usage case: running networks on bare metal
	  - e.g. some CPUs running Linux, other CPUs dedicated to bare-metal networking task, polling
	  - goal: use full NO_HZ to be able to run those as userspace threads
	- detecting idle states in some power domain (system in a cluster for example)
	- desire to use nohz with virtualization, but requires some work, for example kvm patches
	- Jan talks about exposing native APIC to virtual machine.
	- Thomas says that HPET should not be used for anything critical if not necessary
	-  Amit Kucheria: Could we have a kernel api requesting that the cpu not be used?
	    - Useful to wean mobile vendors off CPU Hotplug
	    - Potentially useful for thermal management
	   - Currently, we don't solve this - timers, workqueues might still disturb the CPU
	
	Topic 4: Following mainline with PREEMPT_RT (Slides available as usual)
		- Speaker Sebastian Siewior

- Nicholas asks whether some of these well defined scenarios in mainline code that are problematic for rt could be written as cochinelle scanners?
- Thomas has tried this, but found that it was not possible in practice,  eg: trylocks need to be examined on a case by case basis.
- Thomas says that locks such as local_irq_save act as BKL locks in a way, that they protect "random crap". RT changes this by naming the locks, offering a named scope. This annotation mechanism could be mainlined.
- What if we wanted to protect two data structures? We don't want to call local_irq_save twice. Ans: we could come up with some kind of a nested locking scheme.
- Thomas asks Sebastian if there was anything that he would do different? For example in mainline, that would make the rt effort easier?
	- Ans: consistent annotation, and notation in the mainline kernel
- Carsten asks how much kernel testing is done with mainline preempt.
	- Would it help if more mainline kernel testers / distros tested with the mainline preempt?
	- Enterprise usually chooses voluntary preemption
- Do mainline kernel developers cooperate or provide resistence when rt programmers point out code that causes problems for rt?
	- Mostly they are cooperative, but sometimes there are conflicting goals / scratching out the last bit of performance vs latency.
	- Often there are mutal benefits, increased annotation etc


	Topic:  Read Write Semaphore bottlenecks (only 5-slides reproduced below)
		Speaker: Steven Rostedt

Read Write Semaphores - slide 1
- Allows for multiple readers and only one writer
- They are fair locks
	- New readers will block if a writer is blocked

Read Write Semaphores - slide 2
- Real Time converts them to a simple mutex
- Serializes readers
	- mainline can run parallel
- Affects various work loads drastically
- Note, mainline can be forced to serialize readers if a writer is blocked
	- Remember, they are fair locks

Read Write Semaphores - slide 3
- Biggest culprit for performance issues
	- mmap_sem
	- Page fault
	- Lots of threads (Java!)
	- Peter Zilstra has worked to avoid taking mmap_sem on page faults
- There may be other areas where rwsems are bad

Read Write Semaphores - slide 4
- Priority inheritance is hard
- Doig PI for multiple tasks is even harder
	- was done before and was really complex
	- Tried to keep the fast path
		- use cmpxchg() to grab lock quickly when uncontended
- "train wreck!" (Thomas Gleixner quote)

Read Write Semaphores - slide 5
- Revisit Priority Inheritance
- Forget the fast path (rwsems such anyway)
- Greatly simplifies the algorithm
	- All must take the internal spinlock before taking lock
- But still complex, but reasonable.

Discussion:
    - Thomas: if you boost multiple readers, then this is not a limited mechanism.
	    - tree like fanout in the priority boosting path (which can be deep)
	    - notes that the patch isn't as bad as it was before
    - Steve: says you could limit it.
    - Peter: says you could add a WARN_ON for recursive cases
    - Paul says you could boost serially, that is, one at a time instead of all readers at the same time
	    - Steve says that is already the case
	- One writer has to get all of the readers out of the way first, Thomas elaborates on the serial version, where you just take one at a time from the reader list
		- this is the case for single processor anyway
		- potentially in SMP if you're lucky you might get the multiple readers running on individual cpus
	- the performance decreases we see with rt are usually in the non-rt workloads
	- Steve asks if anyone knows of scenarios in which non-rt workloads decrease dramatically with an rt kernel
	- Carsten says multiple channel-io slows down applications on multicore machines
	
Topic: SIL2LinuxMP: GNU/Linux Multicore platform for safety related systems
	Speaker: Nicholas Mc Guire

Discussion:
    - Nicholas makes the point that what we are checking is not the final code, but the process that creates the code
	    - This keeps "random crap" out of the code
- Nicholas calls the process of discussing what works and doesn't "safety culture" - which Linux does in effect have
- Paul: what would be required to make this happen (safety critical)
- Nicholas: we need to convince a certification authority about our process, select a version, and certify it.

Topic: Avoiding some of the mmap_sem usage
	Speaker: Peter Zijlstra

- mostly gone from various places such as futexes
- currently a lot of work being done with page faults to remove mmap_sem usage
- idea to do rcu like look up on vma tree to avoid use of the semaphore
- Paul mentions that there has been some work showing substantial speed-ups even in mainline for removing this
- Question: are there any problems with speculative look-up?
- Peter: maintains a counter to check whether to invalidate the look-ups, the wrap-around occurs relatively infrequently, in the order of every 4-hrs
- need to look at removing or reducing it for memory compaction during a THP page fault. (non-trivial)
- Peter measured a 124% performance increase with a scientific benchmark from (probably sgi), in mainline (non-rt).

https://etherpad.fr/p/LPC2014_RealTime (Audience can participate in the note taking!!!)
	
if   speakers could either post their slides and send me the URL , or send   me the slides that would be appreciated - jkacur@gmail.com

Topic:  RT Status report
	Speaker: Thomas Gleixner

- Discussed ways to reduce complexity in the rt-patch by pushing various parts upstream.
- Thomas tried to do the simple thing of doing one thing at a time, but ended up with more ten parts with interdependencies, that make it hard to push various components separately upstream
- locking is difficult for rt due to the unnamed locking as mentioned earlier, that acts somewhat like the BKL. 
	- hard to understand what locks are protecting
- people were asking how they can help.
	- difficult to assign jobs to people, not the way that kernel work is done
	- find something you find interesting, something that's small, send patches, listen to reviews, etc
	- the important thing to do is to read the rt-patch and try to understand it.
- a few things were disabled due to complex locking, the normal approach is to disable it, apply the patch, enable small pieces, and watch what explodes.
- Thomas gave a status report on Monday at the RTLWS
	- successful part of rt
		- it is possible to turn linux (general purpose OS) into a real-time OS
			- won out over other approaches
		- various questions, and attempts in the past to do things like budgeting,
		- long list of successful code that got into the mainline kernel such as high resolution timers, priority inheritance, generic IRQs
		- lock dependency validator
		- removed many bugs in locking in the mainline kernel
	- disaster side
		- US Navy [which contributed a lot of the initial money] had a clause in their contract that there would be a best effort to move the work upstream (good), but not everything went upstream.
		- monetary part of rt has dried-up
		- rt is in use, in various products, but there doesn't seem to be industry interest in contributing funding for further development
		- non-rt use case from large companies like Apache, and companies that provide services understand the value of supporting Linux
		- rt use cases, embedded spaces, automobilies, automation, place more value in the hardware that they are building
		- nobody expects a single company to pay for everything, but it would be nice if a lot of small companies would come together to contribute funding.
			- Thomas describes this as a Mikado situation, where whomever moves first - loses. (they pay)
		- Thomas has put the project on "hobby status".
		- There are efforts at organizations such as Linux Foundation, and OSADL to continue the work, find funding, etc.
		- The longer this drags out the harder it will be to restart it because the gap between mainline and rt will widen.
		- New mainline technologies are often harmful to rt, and when the patch is out of tree, it's more difficult to identify these.
		- Linus has no problem to integrate various bits and pieces if they are well done. Needs to see long term interest in maintaining this though.
		
		Q: is the hard part just finding maintainence?
		A: No, the hard part is getting it integrated in the first place, the 2nd hard part is finding long term maintainence. Getting it integrated requires clean-ups of both the rt-patch, and of the mainline code.
		
		Thomas: typically how we've gotten rt code in the mainline previously was to give Linus something he wanted, along with our code.
		
		Carsten: step one is to find funding to get to the level of being able to just create the next kernel versions (typically even numbered kernels).
			- points out that the more we upstream code, the lower the maintaince effort is, but never zero

	Q: is CII (Core Infrastructure Initiative) something we could join?
	A: possibly
	
	- How many members do we need to sustain this?
	Carsten (OSADL), was hoping to have 100 members, and 2 dedicated engineers.
		- the number is not insurmountable.
	
	- what kind of benefit can people get by giving money to OSADL such as training?
	- Carsten answers, legal advice, access to QA farm, free entrance to conferences, marketing services, networking of employees, etc
	- Peter Z: the more you contribute to these "gimmicks", the less money there is for funding rt, development
	- Carsten: not necessarily, because we pay once for the services, but the more people contribute the easier this is, every new member benefits from work that they provide once