IOMMU and VFIO Microconference Notes Welcome to Linux Plumbers Conference 2014. Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session. Please remember there is no video this year, so your notes are the only record of your microconference. Train strike info: Please note that Deutsche Bahn trains are going on strike starting at 2am Saturday, October 18th and will remain on strike through Monday, October 20th. Should this affect your travel plans, below please find alternate transportation options. For more information, please visit: http://www.spiegel.de/reise/aktuell/bahnstreik-am-wochenende-das-muessen-sie-wissen-a-997663.html Speaker 1: Consolidating IOMMU Drivers (Joerg Roedel) - differences visible through IOMMU-API - semantics assign & deassign devices differently: on deassign, some leave device connected to no domain, some connected to a default domain - introduce a 'default domain' (for each group of devices) - audience: different system implementations may require each iommu to be different - Common DMA-API Implementation: - current x86 have their own implementation - WillD: working on a common DMA-API implementation for ARM64 - There is a common one for ARM32 - not suitable as-is for x86 IOMMUS - Want a single one for all IOMMU drivers - requires extension to IOMMU-API (iommu_commit: iotlb flush) - should suit ARM & x86 - requires consistent behavior of IOMMU drivers - Factor out common code - device->IOMMU relationship [...] - Doing it step by step - Idea is to handle the transition of drivers to new IOMMU core incrementally - Add feature reporting to IOMMU drivers - By reporting features (or lack of), drivers can delegate functionality to the core - Allows to introduce the core core changes and then adapt drivers step by step Discussion: WillD: what tracks the dirty parts of the page tables? Joerg: the drivers WillD: isn't that a bit messy? [...] Joerg: with AMD we don't have this problem since it's all flushed WillD: the reason this exists is that iommu_map() tends to be called in a loop, so some context has to be preserved across calls to iommu_map(). But perhaps it is better to simply create an iommu_map() function that takes an entire set of areas to map, which can all be called in one go? Laura: QCOM hardware has a requirement to do an extra TLB flush on both map and unmap - how will this be supported in the common code refactored case? Joerg: AMD hardware has to support flush on unmap anyway - so yes DavidW: yes WillD: has an implementation for some this with callback hooks - will send them to the list Joerg: will it work cross-architecture? WillD: yes Audience 3: some hardware supports multiple address spaces for one device Joerg: this needs an extension of the IOMMU API which we will cover in a later talk Q: Is it possible to declare that any driver that is not converted to the new core in two years, that the driver will be dropped? A: would prefer not Speaker 2: IOMMU Fault Handling (David Woodhouse) (no slides, just discussion) We have some PCI error handling implemented for Power by Linas Vepstas et al - for electrical errors, bus errors, etc. Proposal is to make this generic (in struct device) -- not necessarily page fault handling, but may be common Q: What does the user interface look like to the e1000 driver? A: e1000 doesn't really have this Q: what about platform_bus, other non-PCI? Expose it through VFIO Plan not to do what Power does - i.e., Power isolates the device on the first offense Right now it's possible to die in interrupt storms for faults for devices which we cannot reset - discussion previous day BoF mentioned extending/adding flag to dma i/f so driver can indicate to IOMMU how it can be stopped (Master Enable, reset, ...). Q: there are just so many things that can go wrong (e.g., bit flips), isn't it best to just reset the device when a fault occurs? -- device reset errors are the road block here... :( Alex Graf: On Power, errors are just forwarded to the guest - could do this through VFIO - how do you reset the devices via platform_bus? WillD: how does the guest know that you need to do this? Alex Graf: same way as on real hardware dmwm2: How many IOMMU implementations have the ability to shut up about a device? WillD: ARM has this Stuart Yoder: so does Freescale Alex Graf: Option: Map device to a signle page when faulting to avoid fault flood Arnd: what happens if page table is shared by multiple devices? - then one cannot do this dmwm2: If a device has been disabled when there was no driver for a device, perhaps it can be reenabled on the first dma_map*() to a device WillD: an IOMMU could detect an ECC error - or an error between the device and the memory - one could isolate a device, but ideally one would log it and tell the guest Alex Graf: this boils down to having something in the guest that would receive those errors - so why not just implement this in an SMMU driver in the guest that receives those errors? - Power has pv_iommu that does this Speaker 3: Exposing a virtual IOMMU interface to KVM guests (Will Deacon) http://www.willdeacon.ukfsn.org/bitbucket/lpc-14/vsmmu-lpc14.pdf The problem: - a KVM host can use an IOMMU for device passthrough to a guest - but guest would like to use the IOMMU for DMA and userspace IO - how to do this without paravirtualisation, i.e., to allow unmodified guests? ARM SMMUv2 supports two translation stages: 1. VA (guest addr) -> IPA (host VA) 2. IPA -> physical address (host PA) Implementation in KVM Desired Discussion: - Do other IOMMUs support this? - can the kvm-vfio code be reusable? - what about non-PCI devices? e.g. platform devices, non-discoverable buses - due to KVM PCI emulation - what about I/O page faults? - Error handling - Sharing CPU page tables at both stages - RID/SID/DID mapping - complex I/O toplogies Alex Graf: on Power, we have nesting, but everything just goes to the host WillD: so it detects that it's being virtualised, and asks for a virtualized SID Audience 2: what about x86, does it have second stage? JesseB: it does in the latest VT-d spec for SVM, but may not be possible to use without paravirtualization dmwm2: currently we don't handle the VM nesting case due to ambiguity about which level should do gpa->hpa translation (Istr LaurentP: Renesas hardware has 2nd stage also WillD: would be nice if we shared ioctls at least Joerg: how would the guest set this up? WillD: done in kvm stage 2 not shared there is a VFIO ioctl also WillD: ARM not designed with nesting virtualization as a use case WillD: If one wants to do this in userspace, there can be issues - needs hardware access - would be a lot slower without hardware access WillD: number of contexts are determined at KVM runtime by counting the number of devices Arnd: but what about PCI buses? WillD: simply have a domain for the host controller Arnd: need to be able to give one PCI device to a guest WillD: that would work at the moment as long as they are in different groups - I don't see the point of having more context banks than groups Arnd: you need one vmmu register set per guest Stuart Yoder: we want to pass SMMU access into userspace WillD: stage 2 provides per-VM isolation - stage 1 potentially could put them into the same address space if it wanted to WillD: the guest can see a very simple IO topology, but to the host it can be quite complex Speaker 4: VFIO and non-DMA devices (Stuart Yoder, Bharat Bhushan) How to assign devices that do not do DMA to user space? Not yet implemented - just for discussion Why? - some embedded SoC virt need hardware assigned to KVM VMs - GPIO - Flash - UART - [...] - These are platform_devices, none do DMA - but are normal devices Arnd: vfio-platform was introduced to support platform devices doing DMA Stuart: it will ignore any device that does not route to an IOMMU Stuart: would a UART PCI card with no DMA work fine? Audience 5: yes Audience 5: on x86, we have the same thing: HPET, ... Why VFIO? - alternative is UIO - requires custom kernel driver for each device Arnd: isn't this needed for VFIO also? - User space: QEMU would need new infra - Different workflow for unbinding devs from host, binding to UIO The needed infrastructure already exists for vfio-platform Audience 6: Couldn't you just convert one into the other? Alex Graf: how would you do this? - why not say this is a device that is not wired to the memory bus at all? Audience 6: Couldn't you convert UIO devices into VFIO? Alex Graf: yes you can but not all of them - VFIO requires being able to disable a device interrrupt via the interrupt controller - UIO requires somtehing else (driver ISR?) Proposal - VFIO needs to know if it can safely allow a device to bind to it - When a dev is bound to vfio, if it has the no-dma property, then it is 'safe' and a new file is created in /dev/vfio " echo serial8250.0 > /sys/bus/platform/driver/vfio/bind" - the device name is used for the vfio "group" file - Only use the "group" ioctls in app open ("/dev/vfio/serial8250.0", O_RDWR) Arnd: problem is that VFIO is built around PCI - can't assign platform devices - platform_devices don't have any structure - vfio-platform patch sets won't be merged - IOMMU is not your problem - the question is, what kind of devices do you add, and how do you solve the hard problems? - e.g., power management Alex Graf: talked about this at last year's plumbers - need a host-level driver that handles this WillD: what if you need to drain transactions from the controller? - on PCI, done with a mgjic config space write AlexW: Looks like going w/Jan's proposal to use UIO (cleaned up/modified/improved) for this case would be a better solution. : Mangling VFIO to remove IOMMU restrictions (which UIO doesn't have); main purpose of VFIO is IOMMU mgmt Speaker 5: IOMMU Page Faulting and Linux MM Integration (Joerg Roedel) - hw available in AMD IOMMU -- IOMMUv2 - newer Radeon GPU in APUs - extension module built-in AMD IOMMU driver - implements page-fault loop for devices - pending MMU notifier extension (fix outstanding issue) - share MMU pages w/PCI devices -- PPC has this support & synch interface w/IOMMU (different, arch-specific mmu i/f) - new version when 3.18-rc1 out - flushing IOTLB wrt MMU notifier needs improvement -- Joerg patch to add invalidate_range_notifier() -- WillD: ARM has support in hw for this situation -- AMD hw maintains write/dirty bits once share Future: - Intel SVM spec'd in VT-d - other arch's? - current AMD-onlly code has to be made generic for other users Revisit PASID allocation/handling? Speaker 6: Handling device identity mappings in the IOMMU API (Alex Williamson) - Intel Vt-d: RMRR's (via ACPI tables) - suppose to be limited use for legacy - but an RMRR dump on boot on some systems show they are highly used by PCI devices (Smart Array, BCM5719, ...) - exclude RMRR devices from device assignment - but want to assign device to guests, but part of device is using RMRR DavidW: Don't support device assignment for this configuration... just broken architecture Alex, et. al: easy way for guest to write into rmrr space & corrupt system Some exceptions exist today b/c we know how they work: USB Expected solution: white list devices that aren't well-known(USB) that known to stop using RMRR space after OS takes control of the device. Soln: bad cases use separate PCI-ID & use that ID in RMRR Speaker 7: Device Pagefault vs IOMMU page fault (Jerome Glisse) Mirror address w/sf or hw - hw: AMD: IOMMU w/PASID & ATS - sw: Jerome working on it HMM heterogeneous memory mgmt: Aim to provide single API w/core code tie to MM New DMA paradigm for device mirroring proces address - DMA mapping to address range rather than page containing data - Address range last longer than page backing it [alloc|free]_directory(), update_directory() - looking for a single flush after mapping/unmapping - only for cache-coherent mapping Beyond: share IOMMU page table directory across device; report error back to HMM, to device; more? (pwsan) (ddutile)