File and Storage Systems Microconference Notes Welcome to Linux Plumbers Conference 2014. Shameless plug for upcoming storage conferences: Vault: First ever open source storage conference in Boston in March - http://events.linuxfoundation.org/events/vault Linux Kernel File, Storage and Memory Management summit in Boston that same week: - invitaiton only, watch upstream lists for email about submitting a request to attend/call for ideas Chris Mason: Filesystems status and update Btrfs status update: Heavy development, lots of code is getting in. Stability enhancements, especially in utilities. Important features in development: Sub-pagesize support in btrfs, online-deduplication from Oracle. Out-of-band deduplication already possible with existing tools. Memory requirements of on-line dedup: need to keep lookup table in main memory; Jens Axboe did some experimenting here with Bloom filters. Chris is working on RAID4/5/6 improvements. Persistent memory is being kept in mind, optimizing btrfs for it. ext4 improvements: Bigalloc updates; allows you to allocate several several pages at once. Includes in-line data (storing metadata together with normal data). Also implementing data checksumming. data block encryption. Patches are in the unstable branch of the ext4 tree. Work is going on to support reflinks, block allocation optimized for specific type of devices (SSD, SMR, etc). Encryption will be per file, with a per-directory policy. Filesystems more and more becoming a bottleneck on fast devices, especially due to the blk-mq/scsi-mq changes. Especially visible with large filesystems and lots of files. Problem is that most filesystem code was writting for single-spindle devices, with no real care or update for modern, fast, devices. Mat Wilcox complained about missing DAX from ext4 feature lists, but then this is more an updated infrastructure. (DAX is a port of the current xip to direct-addressable persistent space, bypassing the block-layer altogether). Specific use-cases for btrfs: Using btrfs for embedded devices, possibility to use snapshots for rollbacks. Expose btrfs features via samba; header file for btrfs is still GPLv2. Is now being re-licensed to LGPL courtesy of Oracle. Use-cases when using the network layer for fs access (SMB/CIFS, NFS) exposing reflinks, compression and snapshots is bringing a huge benefit to SMB/CIFS, closing the feature gap to MS As most filesystems have checksumming enabling is DIX/DIF still useful? (Stable pages have _not_ been mentioned) Overlayfs is actively being worked on, so it might get in eventually. Problem is that pages will be shared in the pagecache. Problem with O_DIRECT, as then source and target could be the same page. Also O_DIRECT and mmap() do not play well together; tending to returning random data or simply deadlock Matias Bjorling, Jesper Madsen, Javier Gonzalez: Support for Open-Channel SSDs and flash-agnostic APIs Introducing a new type of SSD: everyone is using SSDs, but it's hard to get predictable performance out of it. Idea is to removing/disabling the FTL, so that hardware details can be used from the host directly. Implementing a library to manage FTL functions; nice side-effect as SMR changes will directly apply here. As this will be a library it can also be exposed to the applications, giving them a nice way of using SSDs directly. Vendors are beginning to allow host integration. Samsung are working towards host-assisted garbage collection. Customers include Baidu who are building a Key-Value store on top of flash and IIT Madras who are building a supercomputer Initially it'll be firmware-based, disabling vendor warranty. The e-MMC drivers have the same issues, and so far hasn't managed to get generic garbage-collection upstream. There is a tool for measurement of latency (Insert name here, poke Bart). Idea here is not to replace the entire FTL, but just tweak the existing FTL to run better. Flash wear is hard to manage; soft ECC error should give the upper layers a hint, but it's very vendor specific. NVMe standard allows for piggy-backing additional information onto I/O, which would allow to attach wear information directly. OSD would be an idea (and in fact a prototype has been implemented), but filesystems have a better understoop usage. Lukas Czerner: How to evaluate block allocator changes xfstest is being used as a testbed; originallhy a regression test-suite. Typically used to implement test-cases based on existing customer issues. Is now being updated to implement I/O performance regression tests. But it's unclear which testcases / loads should be run/implemented. How can we evaluate changes in the block allocator? This affects not only new files, but also file aging. Ric Wheeler wrote fsmark some time ago, but that creates only new files, so the typical filesystem aging with creating/updating/deleting files has not been tested. Someone at 1&1 (a german hosting company) wrote a blk-capture tools, allowing to replay an existing I/O workload. However, this would just test the block layer, not the fs itself. Ted mentioned that the SNIA has plenty of filesystem runs, which should be useable for this. But what are the figure of merits by which we decide which layout is actually 'better' than the first one? There is no single 'best' layout for filesystems, and fragmentation does not necessarily incur a performance overhead. But then there is a risk of optimizing for a specific workload, not improving the overall performance. So it would be best to come up with a comprehensive set of workloads, covering a large enough user-base. Testing against a ram-backed device wouldn't help as this would not take into account any hardware latencies real hardware has. Facebook actually has several servers set apart to run long-term tests for testing the block allocator over time. discussing the interference of FTL and filesystem or raidset Ted ventured that is would be good to come up with a set of abstract measures to evaluate the state of the filesystem in an filesystem independent manner. But this requires quite some research. Aging filesystem might not represent the true state an aged filesystem might look like in the real world. Alternative approach would be by intercepting , but that means storing the data twice and that the store is faster than the original store Just storing metadata wouldn't be enough because ddata could be compressable Hannes asks for a general testset not just for block-access but also for cache-effects Niels de Vos: FS-Cache for filesystems in userspace Improving performance of Gluster by running fscache on the client by storing pages on disk locally; Problem is how the cache should be associated with the volume. It depends on the kind of backend, one has to define the mapping, as fuse needs information from the backing. So extending fuse might be necessary. This is actually under development and experimentation. Not all backends would benefit from fscache. It's a partial implementation, not ready for prime-time as of now. interest from RedHat to implement on modern / recent hardware Chris Mason offers to help on testing. Question: rw or ro? Actual testing on heavy-read-Wokloads Question: is re-exporting the vcache feasible: No! Only little advantage if re-exporting server is closely situated to the cluster-Server Question: Actual difference between kernel-FS and FUSE? it's a question of unique indexing the files. FUSE needs to pass any volume-ID to FS-Cache as an index-level Hannes Reinecke: SMR integration and how to overcome layering issues Hannes refers to his presentation on LinuxCon shows iopatterns on btrfs on SSD-mode for unpacking and patching the kernel Chris wonders why allocation happens backwards Hannes supposes that backward-allocation happens before first superblock Charasteristic of SMR-drives is that the head is broader than the track needs empty space because head overwrites net-effect is denser recording drives are zoned wether sequentiality is preferred or required this means that more information is needed in kernel new RB-Tree in scsi-stack, access needed on divers levels how to handle? another layer or extending request-queue Suggestion: attachi9ng RB-Tree to block-device Hannes prefers to have it in the request-queue as the chunk-sectors are already in Ted recommends no to expose RB-Tree but an abstract interface to enable flexibility and to optionally keep zone-information in memory Ric opposes because it's reinventing the device-mapper optimally the filesystem would handle it Question: comparable to FTL? No: comparable to erase-blocks Chris refers to intelligent devicemapper-target to handle SMR-drives, btrfs can't handle it all Optimizing for edge-cases: low latency Ted is interested in exposing this information to userspace to enable SMR-aware databases Ric reminds tthat the higher the awareness is put, the more infrastructure has to be changed Question: consumer-devices -visible latencies? Yes: video-recording Ted refers to multiple video-streams recorded in parallel with mythtv Philipp Reisner: DRBD9 and drbdmanage Philipp introduces to drbd: raid1 over network, version 8 in-kernel new version: 9, approaching RC new features: up to 31 mirrors and automatic promoting, just umount and remount necessary, infiniband/RDMA new feature: drbdmanage Ric asks if libraries available? Philipp agrees but asks how to integrate with OpenStack, libvirt, OS-installer, libstoragemanagement Ric refers to twelve different storagemenagement-teams drbdmanage abstracts allocating LVs, communicates between nodes via drbd-shared block-devices primary target: cinder-driver for OpenStack Question: is promotion triggert by ervery opening of device? No: mulptiple ro-opener are allowed Hannes has audited udev, recommends tracing udev-rules for opening devices Lukas advises not to integrate in Blivet Ric recommends libstoragemanagement Arnd Bergmann: 64-bit time_t support in filesystems Arnd reminds, that 2038 any 32bit-systems will break because time_t (=seconds since 01-01-1970) wrapps lists filesystems and expiry-date and points out odd ones like cifs, isofs, cramfs ... signed-32bit values will wrap in 2038, unsigned will wrap in 2106 Ted asks regarding time_t in syscalls. Arnd: in-kernel users are mostly fixed, longtime-aim is making 32bit time_t optional, hard part is converting libc without flag-date, arnd refers to debian. Ric questions if new 32bit-hardware is be published, arnd refers to automobiles living more than 20 years Marian Marinov works with containers on capabilities asks for namespace-aware capabilities, especially immutable-bit James questions use of capabilities in containers at all Benjamin LaHaise Multipath: path-checker does not detect lossy / noisy links Hannes recommends switching election algorithm and using an advanced io-schedluer, reminds that in the end i/o needs to be aborted because error-handling is crappy Calvin Owens has patch for not taking space in page-cache for holes in sparse files, asks for use-cases (Matthew Wilcox: I had to step out for this session; the archetype use-case is sparse matrix manipulations where the entire matrix is mapped into memory. I'll contact Calvin about this) Philipp refers to archiving-tools using heuristics to recognise holes Michael Adam introduces new features in smb / cifs Microsoft has released new version because of new usecases putting HyperV and SQL-Server on smb-share new features: active-active clustering, as existing in samba, but implemented differently RDMA (small protocol wrapping SMB, called SMB-direct) samba might need support from kernel Michael asks for collaboration/advice regarding RDMA because library / drivers are not fork-save and more importantly do not support fd-passing Chris refers to Mellanox Michael explains current design with an rdma/smb-direct-proxy-daemon, (at first in user space, later possibly in-kernel), using a shared memory area with the responsible smbd Hannes refers to Kay Sievers working on memory-passing for kdbus But shared memory should be better than memory-passing, says Michael. Hannes proposes multi-threading instead of multi-process. Michael comments that changing all of samba from fork-model to multi-threaded is not an easily feasible task. Matthew Wilcox comments on persistent memory, experiments on different filesystems, btrfs is regarded as unsuitable refers to problems like memory-mapping, rdma and tuncate rdma pins pages into hardware, but truncating changes adresses, same for odirect, splice, ... solution: disallow truncating for files under RDMA how to know? Matthew refers to vma or counters When will it be shipped? Mathew can't tell James questions error-reporting, Matthew refers to dimm-vendors, possibly errors have to be handled! Chris refers to kernel already doing memory-remapping