IOMMU and VFIO Microconference Notes

Welcome to Linux Plumbers Conference 2014.

Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference.

Train strike info:
    Please note that Deutsche Bahn trains are going on strike starting at 2am Saturday, October 18th and will remain on strike through Monday, October 20th.  Should this affect your travel plans, below please find alternate transportation options.  For more information, please visit:
        http://www.spiegel.de/reise/aktuell/bahnstreik-am-wochenende-das-muessen-sie-wissen-a-997663.html


Speaker 1: Consolidating IOMMU Drivers (Joerg Roedel)
	
	- differences visible through IOMMU-API
	-  semantics assign & deassign devices differently: on deassign, some  leave device connected to no domain, some connected to a default domain
	- introduce a 'default domain' (for each group of devices)
	- audience: different system implementations may require each iommu to be different
	- Common DMA-API Implementation:
		- current x86 have their own implementation
		- WillD: working on a common DMA-API implementation for ARM64
		- There is a common one for ARM32
			- not suitable as-is for x86 IOMMUS
	- Want a single one for all IOMMU drivers
		- requires extension to IOMMU-API (iommu_commit: iotlb flush)
		- should suit ARM & x86
		- requires consistent behavior of IOMMU drivers
	- Factor out common code
		- device->IOMMU relationship
		[...]
	- Doing it step by step
		- Idea is to handle the transition of drivers to new IOMMU core incrementally
		- Add feature reporting to IOMMU drivers
		- By reporting features (or lack of), drivers can delegate functionality to the core
		- Allows to introduce the core core changes and then adapt drivers step by step
	
	
	Discussion:
	WillD: what tracks the dirty parts of the page tables?
	Joerg: the drivers
	WillD: isn't that a bit messy?
	[...]
	Joerg: with AMD we don't have this problem since it's all flushed
	WillD: the reason this exists is that iommu_map() tends to be called in a loop, so some context has to be preserved across calls to iommu_map().  But perhaps it is better to simply create an iommu_map() function that takes an entire set of areas to map, which can all be called in one go?
	
	Laura: QCOM hardware has a requirement to do an extra TLB flush on both map and unmap - how will this be supported in the common code refactored case?
	Joerg: AMD hardware has to support flush on unmap anyway - so yes
	DavidW: yes
	WillD: has an implementation for some this with callback hooks - will send them to the list
	Joerg: will it work cross-architecture?
	WillD: yes
	
	Audience 3: some hardware supports multiple address spaces for one device
	Joerg: this needs an extension of the IOMMU API which we will cover in a later talk
	
	Q: Is it possible to declare that any driver that is not converted to the new core in two years, that the driver will be dropped?
	A: would prefer not
	
	
Speaker 2: IOMMU Fault Handling (David Woodhouse)
	
	(no slides, just discussion)
	
	We have some PCI error handling implemented for Power by Linas Vepstas et al
		- for electrical errors, bus errors, etc.
	Proposal is to make this generic (in struct device)
	-- not necessarily page fault handling, but may be common
	
	Q: What does the user interface look like to the e1000 driver?
	A: e1000 doesn't really have this
	
	Q: what about platform_bus, other non-PCI?
	
	Expose it through VFIO
	
	Plan not to do what Power does - i.e., Power isolates the device on the first offense
	Right now it's possible to die in interrupt storms for faults for devices which we cannot reset
		- discussion previous day BoF mentioned extending/adding flag to dma i/f so driver
		can indicate to IOMMU how it can be stopped (Master Enable, reset, ...).
	
	Q: there are just so many things that can go wrong (e.g., bit flips), isn't it best to just reset the device when a fault occurs?  
	-- device reset errors are the road block here... :(
	
	Alex Graf: On Power, errors are just forwarded to the guest
	- could do this through VFIO
	- how do you reset the devices via platform_bus?
	
	WillD: how does the guest know that you need to do this?
	Alex Graf: same way as on real hardware
	
	dmwm2: How many IOMMU implementations have the ability to shut up about a device?
	WillD: ARM has this
	Stuart Yoder: so does Freescale
	Alex Graf: Option: Map device to a signle page when faulting to avoid fault flood
	Arnd: what happens if page table  is shared by multiple devices?
		- then one cannot do this
		
	dmwm2: If a device has been disabled when there was no driver for a device, perhaps it can be reenabled on the first dma_map*() to a device
	
	WillD: an IOMMU could detect an ECC error
		- or an error between the device and the memory
		- one could isolate a device, but ideally one would log it and tell the guest 
	Alex Graf: this boils down to having something in the guest that would receive those errors
		- so why not just implement this in an SMMU driver in the guest that receives those errors?
		- Power has pv_iommu that does this
		
	
Speaker 3: Exposing a virtual IOMMU interface to KVM guests (Will Deacon)

	http://www.willdeacon.ukfsn.org/bitbucket/lpc-14/vsmmu-lpc14.pdf
	
	The problem:
	- a KVM host can use an IOMMU for device passthrough to a guest
	- but guest would like to use the IOMMU for DMA and userspace IO
	- how to do this without paravirtualisation, i.e., to allow unmodified guests?
	
	ARM SMMUv2 supports two translation stages:
	1. VA (guest addr) -> IPA (host VA)
	2. IPA -> physical address (host PA)
	
	Implementation in KVM
	
	Desired Discussion:
	- Do other IOMMUs support this?
	- can the kvm-vfio code be reusable?
	- what about non-PCI devices? e.g. platform devices, non-discoverable buses
		- due to KVM PCI emulation
	- what about I/O page faults?
	- Error handling
	- Sharing CPU page tables at both stages
	- RID/SID/DID mapping
	- complex I/O toplogies
	
	Alex Graf: on Power, we have nesting, but everything just goes to the host
	WillD: so it detects that it's being virtualised, and asks for a virtualized SID
	Audience 2: what about x86, does it have second stage?
		JesseB: it does in the latest VT-d spec for SVM, but may not be possible to use without paravirtualization
		dmwm2: currently we don't handle the VM nesting case due to ambiguity about which level should do gpa->hpa translation
		  (Istr 
	LaurentP: Renesas hardware has 2nd stage also
		WillD: would be nice if we shared ioctls at least

	Joerg: how would the guest set this up?
	WillD: done in kvm
		stage 2 not shared
		there is a VFIO ioctl also

	WillD: ARM not designed with nesting virtualization as a use case
	
	WillD: If one wants to do this in userspace, there can be issues
		- needs hardware access
		- would be a lot slower without hardware access
		
	WillD: number of contexts are determined at KVM runtime by counting the number of devices
	Arnd: but what about PCI buses?
	WillD: simply have a domain for the host controller
	Arnd: need to be able to give one PCI device to a guest
	WillD: that would work at the moment as long as they are in different groups
		- I don't see the point of having more context banks than groups
	Arnd: you need one vmmu register set per guest
	
	Stuart Yoder: we want to pass SMMU access into userspace
	WillD: stage 2 provides per-VM isolation
		- stage 1 potentially could put them into the same address space if it wanted to
		
	WillD: the guest can see a very simple IO topology, but to the host it can be quite complex
	
	
Speaker 4: VFIO and non-DMA devices (Stuart Yoder, Bharat Bhushan)

	How to assign devices that do not do DMA to user space?
	
	Not yet implemented - just for discussion
	
	Why?
	- some embedded SoC virt need hardware assigned to KVM VMs
		- GPIO
		- Flash
		- UART
		- [...]
		
	- These are platform_devices, none do DMA
		- but are normal devices
		
	Arnd: vfio-platform was introduced to support platform devices doing DMA
	Stuart: it will ignore any device that does not route to an IOMMU
	Stuart: would a UART PCI card with no DMA work fine?
	Audience 5: yes
	
	Audience 5: on x86, we have the same thing: HPET, ...
	
	Why VFIO?
	- alternative is UIO
	- requires custom kernel driver for each device
		Arnd: isn't this needed for VFIO also?
	- User space: QEMU would need new infra 
	- Different workflow for unbinding devs from host, binding to UIO
	
	The needed infrastructure already exists for vfio-platform
	
	Audience 6: Couldn't you just convert one into the other?
	Alex Graf: how would you do this?
	- why not say this is a device that is not wired to the memory bus at all?
	Audience 6: Couldn't you convert UIO devices into VFIO?
	Alex Graf: yes you can but not all of them
		- VFIO requires being able to disable a device interrrupt via the interrupt controller - UIO requires somtehing else (driver ISR?)
		
	Proposal
	- VFIO needs to know if it can safely allow a device to bind to it
	- When a dev is bound to vfio, if it has the no-dma property, then it is 'safe' and a new file is created in /dev/vfio 
		" echo serial8250.0 > /sys/bus/platform/driver/vfio/bind"
	- the device name is used for the vfio "group" file
	- Only use the "group" ioctls in app
		open ("/dev/vfio/serial8250.0", O_RDWR)

	Arnd: problem is that VFIO is built around PCI
		- can't assign platform devices
		- platform_devices don't have any structure
		- vfio-platform patch sets won't be merged
		- IOMMU is not your problem
		- the question is, what kind of devices do you add, and how do you solve the hard problems?
			- e.g., power management
	Alex Graf: talked about this at last year's plumbers
		- need a host-level driver that handles this
	
	WillD: what if you need to drain transactions from the controller?
		- on PCI, done with a mgjic config space write
		
	AlexW: Looks like going w/Jan's proposal to use UIO (cleaned up/modified/improved) for this case would be a better solution.
	          :  Mangling VFIO to remove IOMMU restrictions (which UIO doesn't have); main purpose of VFIO is IOMMU mgmt
	
	
Speaker 5: IOMMU Page Faulting and Linux MM Integration (Joerg Roedel)
	- hw available in AMD IOMMU -- IOMMUv2
	- newer Radeon GPU in APUs
	- extension module built-in AMD IOMMU driver
	- implements page-fault loop for devices
	- pending MMU notifier extension (fix outstanding issue) - share MMU pages w/PCI devices
		-- PPC has this support & synch interface w/IOMMU (different, arch-specific mmu i/f)
	- new version when 3.18-rc1 out
	- flushing IOTLB wrt MMU notifier needs improvement
		-- Joerg patch to add invalidate_range_notifier()
		-- WillD: ARM has support in hw for this situation
	-- AMD hw maintains write/dirty bits once share
	Future:
		- Intel SVM spec'd in VT-d
		- other arch's?
		- current AMD-onlly code has to be made generic for other users
		Revisit PASID allocation/handling?
		

Speaker 6:  Handling device identity mappings in the IOMMU API (Alex Williamson)
	- Intel Vt-d: RMRR's (via ACPI tables)
	- suppose to be limited use for legacy
		- but an RMRR dump on boot on some systems show they are highly used
		by PCI devices (Smart Array, BCM5719, ...)
	- exclude RMRR devices from device assignment 
	- but want to assign device to guests, but part of device is using RMRR
	DavidW: Don't support device assignment for this configuration... just broken architecture
	Alex, et. al: easy way for guest to write into rmrr space & corrupt system
	Some exceptions exist today b/c we know how they work: USB
	Expected solution: white list devices that aren't well-known(USB) that known to stop
	using RMRR space after OS takes control of the device.
	Soln: bad cases use separate PCI-ID & use that ID in RMRR
	
Speaker 7: Device Pagefault vs IOMMU page fault (Jerome Glisse)
	Mirror address w/sf or hw
		- hw: AMD: IOMMU w/PASID & ATS
		- sw: Jerome working on it
	HMM heterogeneous memory mgmt: Aim to provide single API w/core code tie to MM
	New DMA paradigm for device mirroring proces address
		- DMA mapping to address range rather than page containing data
		- Address range last longer than page backing it
			[alloc|free]_directory(), update_directory()
		- looking for a single flush after mapping/unmapping
		- only for cache-coherent mapping
	Beyond: share IOMMU page table directory across device; report error back to HMM, to device; more?
	

(pwsan)
(ddutile)