Restricting path name lookup with openat2()
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Looking up a file given a path name seems like a straightforward task, but it turns out to be one of the more complex things the kernel does. Things get more complicated if one is trying to write robust (user-space) code that can do the right thing with paths that are controlled by a potentially hostile user. Attempts to make the open() and openat() system calls safer date back at least to an attempt to add O_BENEATH in 2014, but numerous problems remain. Aleksa Sarai, who has been working in this area for a while, has now concluded that a new version of openat(), naturally called openat2(), is required to truly solve this problem.
The immediate purpose behind openat2() is to allow a program to safely open a path that is possibly under the control of an attacker; in practice, that means placing restrictions on how the lookup process will be carried out. Past attempts have centered around adding new flags to openat(), but there are a couple of problems with that approach: openat() doesn't check for unknown flags, and the number of available bits for new flags is not large. The failure to check for unknown flags is a well-known antipattern. A program using a path-restricting flag needs to know whether the requested behavior is understood by the kernel or not; the alternative is to accept security vulnerabilities on kernels that do not implement those flags.
Changing openat() to return errors when passed unknown flags is not an option; that would almost certainly break existing code. So the only alternative is to create a new system call that does check its flags; thus openat2():
struct open_how { __u32 flags; union { __u16 mode; __u16 upgrade_mask; }; __u16 resolve; __u64 reserved[7]; /* must be zeroed */ }; int openat2(int dfd, const char *filename, const struct open_how *how);
(Note that, as always, this is how the system call is presented at the user-space boundary; C libraries could expose something different.)
Rather than add another argument for the new flags, Sarai opted to move all of the behavior-affecting information into a separate structure. Among other things, this means that, unlike open() and openat(), the new system call always has the same number of arguments. The flags field holds the same O_ flags understood by open() and openat(), but unknown flags in this field will result in an error. Similarly, the mode field contains the permission bits used when a new file is created.
The resolve field, instead, contains a new set of flags that control how pathname lookup will be performed. The flags implemented in the patch set are:
- RESOLVE_NO_XDEV
- The lookup process will not be allowed to cross any mount points (including bind mounts). In other words, the file to be opened must reside on the same mount as the dfd descriptor (or the current working directory if dfd is passed as AT_FDCWD).
- RESOLVE_NO_MAGICLINKS
- There are relatively few types of objects that can be found in a filesystem directory; they include regular files, directories, devices, FIFOs, and symbolic links. With this option, Sarai is in essence acknowledging that there is another type that has been lurking in plain sight for years: the "magic link". Examples include the links found in /proc/PID/fd directories; they are implemented by the kernel and have some possibly surprising properties.
The presence of this flag will prevent a path lookup operation from traversing through one of these magic links, thus blocking (for example) attempts to escape from a container via a /proc entry for an open file descriptor.
- RESOLVE_NO_SYMLINKS
- Blocks any traversal through symbolic links, including magic links. This option differs from the O_NOFOLLOW flag in that it prevents following a link at any point in the lookup process, while O_NOFOLLOW only applies to the last component in the path.
- RESOLVE_BENEATH
- The lookup process is contained within the directory tree below the starting point; attempts to use components like "../" to escape that tree will generate an error.
- RESOLVE_IN_ROOT
- This flag causes the lookup process to behave as if a chroot() to the starting point had been performed. Absolute paths will begin relative to the starting directory, and "../" will not proceed above that directory. Some work has been done to make RESOLVE_IN_ROOT free of some of the race conditions that plague chroot(); see this changelog for some details.
The issue of magic links appears a few times in this patch set. The RESOLVE_NO_MAGICLINKS option prevents them from being traversed (or opened), but it turns out that there are numerous cases where it is indeed useful to open such links. The problem is that allowing that to happen can be dangerous; the runc container breakout vulnerability reported in February was the result of hostile code using the /proc/PID/exe link to open the runc binary for write access. That can make for leaky containers, to put it mildly.
The first patch in the series changes the semantics of open() (in all variants) when magic links are involved: while previously the permission bits on the magic link itself were ignored, now they are taken into account. Then, for example, the permissions for /proc/PID/exe are changed in the kernel to disallow opening for write access, blocking the runc breakout attack.
This change enables one other feature provided by openat2() (and, indeed, openat() as well). There is a new flag (O_EMPTYPATH) that causes the path argument to be ignored; instead, the call will simply reopen the file descriptor passed as the dfd argument using the new mode provided. A common use case is to reopen a file descriptor initially opened with O_PATH to gain access to file contents or metadata — access which is otherwise not possible with an O_PATH descriptor (see the man page for details on O_PATH). Programs have typically done this sort of reopening using a path into /proc/PID/fd, but O_EMPTYPATH will work even if /proc is not available.
Finally, the new API also allows the placement of limits on how a file descriptor created with O_PATH can be "upgraded" as described above. When openat2() is used to open an O_PATH file descriptor, the upgrade_mask field in the how structure can be used to limit the access that can be obtained by reopening in the future. Specifying UPGRADE_NOREAD will prevent reopening with read access, and UPGRADE_NOWRITE will prevent the acquisition of write access. This restriction can limit the damage should a hostile program obtain access to an O_PATH file descriptor.
Previous versions of this patch set have generated a fair amount of
discussion. The relative quiet after this posting may reflect the fact
that most of the concerns raised have been addressed over time — or
possibly just that the people who would comment on it are attending the
Linux Security Summit and falling behind on email. Either way, there is a
clear demand for the ability to restrict how file path names are traversed
so, sooner or later, some version of this patch set seems likely to find
its way into the mainline.
Index entries for this article | |
---|---|
Kernel | Filesystems/Virtual filesystem layer |
Kernel | System calls/openat2() |
Security | Linux kernel/Virtual filesystem layer |
(Log in to post comments)
Restricting path name lookup with openat2()
Posted Aug 22, 2019 21:10 UTC (Thu) by brauner (subscriber, #109349) [Link]
I'm very happy that the precedent we set with clone3() in using a dedicated and more easily extensible structures in syscall arguments is seeing more acceptance. The flag argument would usually have been placed in a register making the syscall hard to extend once we're out of flags. Yes, there's a problem for seccomp with these sycalls since it can't filter pointer arguments currently but hopefully we will enable seccomp to do this after we had time to discuss this at KSummit:
https://lists.linuxfoundation.org/pipermail/ksummit-discu...
Restricting path name lookup with openat2()
Posted Aug 22, 2019 22:21 UTC (Thu) by roc (subscriber, #30627) [Link]
> With a solution looming, now my mind turns to "how do we write filters
that check argument data?" Can this be done sanely with cBPF or are we
finally to requiring eBPF?
If unprivileged eBPF is dead then it would be hugely problematic to require eBPF for seccomp filters that check argument data.
Restricting path name lookup with openat2()
Posted Aug 22, 2019 23:53 UTC (Thu) by sbaugh (subscriber, #103291) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 5:29 UTC (Fri) by epa (subscriber, #39769) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 13:17 UTC (Fri) by sbaugh (subscriber, #103291) [Link]
Restricting path name lookup with openat2()
Posted Aug 24, 2019 14:53 UTC (Sat) by dezgeg (subscriber, #92243) [Link]
If you are injecting a syscall via ptrace(), don't you need to anyway to allocate memory for the syscall instruction itself? Not to mention for openat() you will need memory just for the path string itself...
Restricting path name lookup with openat2()
Posted Aug 26, 2019 19:29 UTC (Mon) by sbaugh (subscriber, #103291) [Link]
Most trivially, when the stack hasn't been allocated yet. Or when it's fixed in size. Of course you wouldn't establish a signal handler in this environment anyway.
>If you are injecting a syscall via ptrace(), don't you need to anyway to allocate memory for the syscall instruction itself? Not to mention for openat() you will need memory just for the path string itself...
As one example, you could be using the existing glibc syscall() function, and use a path string that already exists in the target program's address space. That's a somewhat unusual example, but it's a situation where this requirement for allocating more memory for the argument struct is clearly limiting.
Restricting path name lookup with openat2()
Posted Sep 6, 2019 7:45 UTC (Fri) by polyp (guest, #53146) [Link]
Restricting path name lookup with openat2()
Posted Aug 26, 2019 11:26 UTC (Mon) by scientes (guest, #83068) [Link]
Ummmm, you can just use the last flag as a "look in this other register for more flags", kind of the way fnctl works.......
How disappointing!
Posted Aug 23, 2019 4:16 UTC (Fri) by felixfix (subscriber, #242) [Link]
How disappointing!
Posted Aug 23, 2019 19:15 UTC (Fri) by k8to (guest, #15413) [Link]
(For others not aware, may fourth is a star wars pun/joke date. (may the force..))
How disappointing!
Posted Aug 23, 2019 22:56 UTC (Fri) by Beolach (guest, #77384) [Link]
How disappointing!
Posted Aug 23, 2019 23:28 UTC (Fri) by felixfix (subscriber, #242) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 4:55 UTC (Fri) by epa (subscriber, #39769) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 12:00 UTC (Fri) by cyphar (subscriber, #110703) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 12:38 UTC (Fri) by Paf (subscriber, #91811) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 16:50 UTC (Fri) by NYKevin (subscriber, #129325) [Link]
The only way an attacker can go after that is to 1) ptrace you or 2) already be running code in your process. If either of those is the case, then you've already lost.
Restricting path name lookup with openat2()
Posted Aug 23, 2019 20:05 UTC (Fri) by epa (subscriber, #39769) [Link]
Restricting path name lookup with openat2()
Posted Aug 24, 2019 5:40 UTC (Sat) by cyphar (subscriber, #110703) [Link]
This isn't all a hypothetical -- my first draft of the syscall did just add a new argument, and I discovered pretty quickly (while writing the selftests) that it was abysmal to actually use that interface. The fact that C zeroes out structs when you do designated initialisation makes using structs so much more straightforward here. All of that being said, I'm not married to the current interface at all. If the only concern people have with the patches is what the syscall looks like, I'm more than happy to change it.
Restricting path name lookup with openat2()
Posted Aug 24, 2019 8:00 UTC (Sat) by epa (subscriber, #39769) [Link]
Restricting path name lookup with openat2()
Posted Aug 24, 2019 8:26 UTC (Sat) by cyphar (subscriber, #110703) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 6:21 UTC (Fri) by pr1268 (subscriber, #24648) [Link]
__u64 reserved[7]; /* must be zeroed */
Why so much empty reserved space? Requiring the use of a struct with 56 empty bytes simply for "reserve" seems unusual.
I'm sure there's a good reason, not just for having the reserve space, but also for having so much of it, but I didn't glean that from the article.
Restricting path name lookup with openat2()
Posted Aug 23, 2019 11:57 UTC (Fri) by cyphar (subscriber, #110703) [Link]
* Though, the struct being 64-bytes overall does mean it fits in one cache-line. That's not a good argument for it to be that big, but it does mean that making it any bigger than 64 bytes would be a much worse idea. I figured that I might as well make it as big as reasonable and see what other ideas people came up with.
Restricting path name lookup with openat2()
Posted Aug 23, 2019 22:21 UTC (Fri) by josh (subscriber, #17465) [Link]
Restricting path name lookup with openat2()
Posted Aug 24, 2019 5:54 UTC (Sat) by buck (subscriber, #55985) [Link]
Restricting path name lookup with openat2()
Posted Aug 24, 2019 20:05 UTC (Sat) by quotemstr (subscriber, #45331) [Link]
Restricting path name lookup with openat2()
Posted Jan 9, 2021 7:55 UTC (Sat) by Serentty (guest, #132335) [Link]
Restricting path name lookup with openat2()
Posted Feb 3, 2021 5:50 UTC (Wed) by cyphar (subscriber, #110703) [Link]
clone3 and openat2 use the same design for extension, and Christian and I gave a talk about this design at Linux Plumbers last year. You're commenting on a thread which is almost 2 years old. :P
Restricting path name lookup with openat2()
Posted Aug 23, 2019 12:05 UTC (Fri) by cyphar (subscriber, #110703) [Link]
Restricting path name lookup with openat2()
Posted Aug 28, 2019 18:56 UTC (Wed) by nix (subscriber, #2304) [Link]
(In times past, I would have hoped that breaking real use cases could be fixed on a case-by-case basis by raising a bug and fixing the software, but in the current environment I just bet some people would say "Linux is not about choice" and demand that users stop using mount points in ways that break their software instead: after all, their laptop has only one big mount under / so your system should too. It seems best to me to try to stop this sort of thing from happening in the first place.)
Restricting path name lookup with openat2()
Posted Aug 29, 2019 3:44 UTC (Thu) by cyphar (subscriber, #110703) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 13:17 UTC (Fri) by walters (subscriber, #7396) [Link]
Restricting path name lookup with openat2()
Posted Aug 23, 2019 19:22 UTC (Fri) by k8to (guest, #15413) [Link]
I'm also kind of a newb on system call versioning. I've written code that tests for availability of system calls to decide what path to implement (which was a bit scary across the various architectures, but I got it right in the end), but I haven't written code to try to decide if a particular system call has a particular feature. Is there a better way than kernel version numbers?
Restricting path name lookup with openat2()
Posted Aug 24, 2019 4:46 UTC (Sat) by pr1268 (subscriber, #24648) [Link]
I was curious about that, too, but I didn't ask that in my previous question.
It would seem that (1) callers ensuring that reserved is zeroed, and (2) openat2 enforcing this zeroing would be really expensive.
Restricting path name lookup with openat2()
Posted Aug 24, 2019 5:23 UTC (Sat) by cyphar (subscriber, #110703) [Link]
Yes, openat2(2) checks that it's zeroed. It's not particularly expensive, there's an optimised function for doing checks like that in lib/ (memchr_inv). Zeroing structures and flag bits before passing them to the kernel is already a very common practice (as discussed in this LWN article).
Restricting path name lookup with openat2()
Posted Aug 24, 2019 5:28 UTC (Sat) by cyphar (subscriber, #110703) [Link]
Yes, it is checked (you get -EINVAL if it's non-zero).As for figuring out whether syscalls have particular features, the best way is to pass the flag and see if you get -EINVAL -- this is why checking whether there are unknown flags present and returning -EINVAL is important in syscall design. If you don't check whether unknown flags are passed, you end up with situations where userspace cannot easily figure out whether the flag is actually supported. open(2) doesn't do this check (which makes it significantly more complicated to figure out whether your kernel supports a particular open(2) feature), but in openat2(2) we do check whether there are unknown O_* flags present.
Restricting path name lookup with openat2()
Posted Aug 24, 2019 20:14 UTC (Sat) by quotemstr (subscriber, #45331) [Link]
Restricting path name lookup with openat2()
Posted Aug 25, 2019 14:51 UTC (Sun) by cyphar (subscriber, #110703) [Link]
All of that being said, I am gravitating towards not having reserved space. I don't have particularly strong opinions either way.
Restricting path name lookup with openat2()
Posted Aug 26, 2019 19:18 UTC (Mon) by k8to (guest, #15413) [Link]
Restricting path name lookup with openat2()
Posted Mar 30, 2020 16:09 UTC (Mon) by gb (subscriber, #58328) [Link]
Why not introduce flag 'generate error on unknown flags' and introduce new syscall once you really run out of flags not in advance?
Restricting path name lookup with openat2()
Posted Mar 30, 2020 23:14 UTC (Mon) by nybble41 (subscriber, #55106) [Link]
If you don't make unknown flags an error up front then the unused bits basically become permanently reserved. You can never safely *stop* ignoring them.
Restricting path name lookup with openat2()
Posted Mar 31, 2020 7:07 UTC (Tue) by scientes (guest, #83068) [Link]
Becuase that unknown flag could have been passed by a program before it was introduced....