On predicting predictors: hacking archive formats for fun and prophecy*
We aim to inform you about the archive formats you use every day. We will include an in-depth look at the tar, ar, cpio, gzip, bzip2, and deb formats, as well as the internals of the Git object store. Armed with this information, we will show you a practical application: removing the redundancy between files in version control and distributions of source and binaries.
Existing projects like pristine-tar focus on finding the right options to the compression code to reproduce the file from the uncompressed data (“gzip -9 —rsyncable”), treating the file formats as magic black boxes. Our in-depth analysis of archive formats lets us record just enough information to reproduce any archive regardless of the tool used to produce it.
archive, file formats, compression, pristine, tar, cpio, gzip, bzip2, ar, deb, git
Josh Triplett is a PhD student at Portland State University and a Free and Open Source Software hacker. Josh is involved in research on relativistic programming and advanced synchronization techniques for highly parallel systems. Josh builds and launches Linux-powered rockets with the Portland State Aerospace Society, and hacks on numerous other projects . Lately, Josh does a lot of his hacking in Haskell.