Restricting path name lookup with openat2()

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
August 22, 2019

Looking up a file given a path name seems like a straightforward task, but it turns out to be one of the more complex things the kernel does. Things get more complicated if one is trying to write robust (user-space) code that can do the right thing with paths that are controlled by a potentially hostile user. Attempts to make the open() and openat() system calls safer date back at least to an attempt to add O_BENEATH in 2014, but numerous problems remain. Aleksa Sarai, who has been working in this area for a while, has now concluded that a new version of openat(), naturally called openat2(), is required to truly solve this problem.

The immediate purpose behind openat2() is to allow a program to safely open a path that is possibly under the control of an attacker; in practice, that means placing restrictions on how the lookup process will be carried out. Past attempts have centered around adding new flags to openat(), but there are a couple of problems with that approach: openat() doesn't check for unknown flags, and the number of available bits for new flags is not large. The failure to check for unknown flags is a well-known antipattern. A program using a path-restricting flag needs to know whether the requested behavior is understood by the kernel or not; the alternative is to accept security vulnerabilities on kernels that do not implement those flags.

Changing openat() to return errors when passed unknown flags is not an option; that would almost certainly break existing code. So the only alternative is to create a new system call that does check its flags; thus openat2():

    struct open_how {
	__u32 flags;
	union {
	    __u16 mode;
	    __u16 upgrade_mask;
	};
	__u16 resolve;
	__u64 reserved[7]; /* must be zeroed */
    };

    int openat2(int dfd, const char *filename, const struct open_how *how);

(Note that, as always, this is how the system call is presented at the user-space boundary; C libraries could expose something different.)

Rather than add another argument for the new flags, Sarai opted to move all of the behavior-affecting information into a separate structure. Among other things, this means that, unlike open() and openat(), the new system call always has the same number of arguments. The flags field holds the same O_ flags understood by open() and openat(), but unknown flags in this field will result in an error. Similarly, the mode field contains the permission bits used when a new file is created.

The resolve field, instead, contains a new set of flags that control how pathname lookup will be performed. The flags implemented in the patch set are:

RESOLVE_NO_XDEV

The lookup process will not be allowed to cross any mount points (including bind mounts). In other words, the file to be opened must reside on the same mount as the dfd descriptor (or the current working directory if dfd is passed as AT_FDCWD).

RESOLVE_NO_MAGICLINKS

There are relatively few types of objects that can be found in a filesystem directory; they include regular files, directories, devices, FIFOs, and symbolic links. With this option, Sarai is in essence acknowledging that there is another type that has been lurking in plain sight for years: the "magic link". Examples include the links found in /proc/PID/fd directories; they are implemented by the kernel and have some possibly surprising properties.
The presence of this flag will prevent a path lookup operation from traversing through one of these magic links, thus blocking (for example) attempts to escape from a container via a /proc entry for an open file descriptor.

RESOLVE_NO_SYMLINKS

Blocks any traversal through symbolic links, including magic links. This option differs from the O_NOFOLLOW flag in that it prevents following a link at any point in the lookup process, while O_NOFOLLOW only applies to the last component in the path.

RESOLVE_BENEATH

The lookup process is contained within the directory tree below the starting point; attempts to use components like "../" to escape that tree will generate an error.

RESOLVE_IN_ROOT

This flag causes the lookup process to behave as if a chroot() to the starting point had been performed. Absolute paths will begin relative to the starting directory, and "../" will not proceed above that directory. Some work has been done to make RESOLVE_IN_ROOT free of some of the race conditions that plague chroot(); see this changelog for some details.

The issue of magic links appears a few times in this patch set. The RESOLVE_NO_MAGICLINKS option prevents them from being traversed (or opened), but it turns out that there are numerous cases where it is indeed useful to open such links. The problem is that allowing that to happen can be dangerous; the runc container breakout vulnerability reported in February was the result of hostile code using the /proc/PID/exe link to open the runc binary for write access. That can make for leaky containers, to put it mildly.

The first patch in the series changes the semantics of open() (in all variants) when magic links are involved: while previously the permission bits on the magic link itself were ignored, now they are taken into account. Then, for example, the permissions for /proc/PID/exe are changed in the kernel to disallow opening for write access, blocking the runc breakout attack.

This change enables one other feature provided by openat2() (and, indeed, openat() as well). There is a new flag (O_EMPTYPATH) that causes the path argument to be ignored; instead, the call will simply reopen the file descriptor passed as the dfd argument using the new mode provided. A common use case is to reopen a file descriptor initially opened with O_PATH to gain access to file contents or metadata — access which is otherwise not possible with an O_PATH descriptor (see the man page for details on O_PATH). Programs have typically done this sort of reopening using a path into /proc/PID/fd, but O_EMPTYPATH will work even if /proc is not available.

Finally, the new API also allows the placement of limits on how a file descriptor created with O_PATH can be "upgraded" as described above. When openat2() is used to open an O_PATH file descriptor, the upgrade_mask field in the how structure can be used to limit the access that can be obtained by reopening in the future. Specifying UPGRADE_NOREAD will prevent reopening with read access, and UPGRADE_NOWRITE will prevent the acquisition of write access. This restriction can limit the damage should a hostile program obtain access to an O_PATH file descriptor.

Previous versions of this patch set have generated a fair amount of discussion. The relative quiet after this posting may reflect the fact that most of the concerns raised have been addressed over time — or possibly just that the people who would comment on it are attending the Linux Security Summit and falling behind on email. Either way, there is a clear demand for the ability to restrict how file path names are traversed so, sooner or later, some version of this patch set seems likely to find its way into the mainline.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Kernel	System calls/openat2()
Security	Linux kernel/Virtual filesystem layer

(Log in to post comments)

Restricting path name lookup with openat2()

Posted Aug 22, 2019 21:10 UTC (Thu) by brauner (subscriber, #109349) [Link]

I just mentioned this patchset in the "Making Containers Safer" talk we gave at LSS.

I'm very happy that the precedent we set with clone3() in using a dedicated and more easily extensible structures in syscall arguments is seeing more acceptance. The flag argument would usually have been placed in a register making the syscall hard to extend once we're out of flags. Yes, there's a problem for seccomp with these sycalls since it can't filter pointer arguments currently but hopefully we will enable seccomp to do this after we had time to discuss this at KSummit:
https://lists.linuxfoundation.org/pipermail/ksummit-discu...

Restricting path name lookup with openat2()

Posted Aug 22, 2019 22:21 UTC (Thu) by roc (subscriber, #30627) [Link]

Hmm, that led me to Kees Cook's message: https://lists.linuxfoundation.org/pipermail/ksummit-discu...
> With a solution looming, now my mind turns to "how do we write filters
that check argument data?" Can this be done sanely with cBPF or are we
finally to requiring eBPF?

If unprivileged eBPF is dead then it would be hugely problematic to require eBPF for seccomp filters that check argument data.

Restricting path name lookup with openat2()

Posted Aug 22, 2019 23:53 UTC (Thu) by sbaugh (subscriber, #103291) [Link]

I'm not a fan of using a structure in memory to pass arguments that would fit in registers. Besides the very important seccomp issues, I also find it distasteful to have to allocate memory (stack allocation is allocation!) just to pass arguments to the kernel. That makes it more difficult and less efficient to use these syscalls in any case where memory is not so easily allocatable, including in early program startup, assembly, or when ptracing.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 5:29 UTC (Fri) by epa (subscriber, #39769) [Link]

In single-threaded code can’t you have a single statically allocated struct you use for every open2()?

Restricting path name lookup with openat2()

Posted Aug 23, 2019 13:17 UTC (Fri) by sbaugh (subscriber, #103291) [Link]

A statically allocated region for arguments works in some single threaded environments, but not in early program start, when ptracing, or, of course, any multi-threaded environments; that last one is enough to rule it out IMO.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 14:53 UTC (Sat) by dezgeg (subscriber, #92243) [Link]

How is stack memory not 'easily allocatable' in early program startup or assembly? Unless you have called sigaltstack() or blocked all signals, a signal sent to the process can cause stack allocation for the signal handler at any time.

If you are injecting a syscall via ptrace(), don't you need to anyway to allocate memory for the syscall instruction itself? Not to mention for openat() you will need memory just for the path string itself...

Restricting path name lookup with openat2()

Posted Aug 26, 2019 19:29 UTC (Mon) by sbaugh (subscriber, #103291) [Link]

>How is stack memory not 'easily allocatable' in early program startup or assembly?

Most trivially, when the stack hasn't been allocated yet. Or when it's fixed in size. Of course you wouldn't establish a signal handler in this environment anyway.

>If you are injecting a syscall via ptrace(), don't you need to anyway to allocate memory for the syscall instruction itself? Not to mention for openat() you will need memory just for the path string itself...

As one example, you could be using the existing glibc syscall() function, and use a path string that already exists in the target program's address space. That's a somewhat unusual example, but it's a situation where this requirement for allocating more memory for the argument struct is clearly limiting.

Restricting path name lookup with openat2()

Posted Sep 6, 2019 7:45 UTC (Fri) by polyp (guest, #53146) [Link]

The system call overhead is probably hundreds of CPU cycles so why worry about maybe 5 cycles for allocating and populating a small constant-size structure on the stack? IMHO the gains in extensibility clearly outweigh the performance hit.

Restricting path name lookup with openat2()

Posted Aug 26, 2019 11:26 UTC (Mon) by scientes (guest, #83068) [Link]

> The flag argument would usually have been placed in a register making the syscall hard to extend once we're out of flags.

Ummmm, you can just use the last flag as a "look in this other register for more flags", kind of the way fnctl works.......

How disappointing!

Posted Aug 23, 2019 4:16 UTC (Fri) by felixfix (subscriber, #242) [Link]

Why not call it "openatat" and release it on May 4th?

How disappointing!

Posted Aug 23, 2019 19:15 UTC (Fri) by k8to (guest, #15413) [Link]

I googled this and now am in pain.

(For others not aware, may fourth is a star wars pun/joke date. (may the force..))

How disappointing!

Posted Aug 23, 2019 22:56 UTC (Fri) by Beolach (guest, #77384) [Link]

I got it & groaned... the other half of the pun: https://starwars.fandom.com/wiki/All_Terrain_Armored_Tran...

How disappointing!

Posted Aug 23, 2019 23:28 UTC (Fri) by felixfix (subscriber, #242) [Link]

Sorry (but not much :-) I come from a lot of assembler coding where labels and variable names were pretty short and always was on the lookout for bad jokes. I don't know any good ones :-)

Restricting path name lookup with openat2()

Posted Aug 23, 2019 4:55 UTC (Fri) by epa (subscriber, #39769) [Link]

It might be useful to have a flag which suppresses directory traversal altogether, that is, only lets you open something directly inside the current working directory.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 12:00 UTC (Fri) by cyphar (subscriber, #110703) [Link]

That might be useful, though it could be done (without a hypothetical RESOLVE_NO_SUBDIRS) by checking whether the path contains a '/' in userspace and using O_NOFOLLOW (you wouldn't even need RESOLVE_NO_SYMLINKS).

Restricting path name lookup with openat2()

Posted Aug 23, 2019 12:38 UTC (Fri) by Paf (subscriber, #91811) [Link]

I think the idea of attacker controlled path is supposed to extend to full control over that blob of memory, so any user space checks are subject to races.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 16:50 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Copy the shared buffer into an area of private memory, and thereafter only use the private buffer.

The only way an attacker can go after that is to 1) ptrace you or 2) already be running code in your process. If either of those is the case, then you've already lost.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 20:05 UTC (Fri) by epa (subscriber, #39769) [Link]

Right, there are safe ways to do it, but a flag would be convenient, obvious, and relatively foolproof.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 5:40 UTC (Sat) by cyphar (subscriber, #110703) [Link]

With the downside that using the syscall would be more frustrating -- users that don't care about resolution would need to pass an extra 0 all the time (and the upgrade_mask union would be quite ugly to use too). Not to mention that there is a limit on the number of syscall arguments (6), so in the future we would probably need to switch to a struct-based argument anyway. Linus also mentioned in an earlier review that he wasn't a fan of the variadic open(2) semantics -- so I would like to avoid carrying that syscall design forward as well.

This isn't all a hypothetical -- my first draft of the syscall did just add a new argument, and I discovered pretty quickly (while writing the selftests) that it was abysmal to actually use that interface. The fact that C zeroes out structs when you do designated initialisation makes using structs so much more straightforward here. All of that being said, I'm not married to the current interface at all. If the only concern people have with the patches is what the syscall looks like, I'm more than happy to change it.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 8:00 UTC (Sat) by epa (subscriber, #39769) [Link]

When I said a ‘flag’ I didn’t mean it had to be an extra argument. A Boolean flag can be passed in lots of ways, including a bitmask of flags, and could be part of a struct. What I meant was, in addition to all the funky options mentioned in the article to stop following ‘magic links’ and so on, there could be one more to stop following any directory traversal whatever.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 8:26 UTC (Sat) by cyphar (subscriber, #110703) [Link]

Oh, I'm really sorry -- I completely misread the rest of the comment thread I was responding to. I thought you were arguing for not having a struct *at all* (as someone else has suggested in a separate thread) and that's what I was talking about. Yes, a RESOLVE_NO_SUBDIRS (or whatever) could be useful -- though I'd prefer to land openat2() first and then we can work on extensions like that (I'm already worried enough that the patch touches too many things).

Restricting path name lookup with openat2()

Posted Aug 23, 2019 6:21 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

__u64 reserved[7]; /* must be zeroed */

Why so much empty reserved space? Requiring the use of a struct with 56 empty bytes simply for "reserve" seems unusual.

I'm sure there's a good reason, not just for having the reserve space, but also for having so much of it, but I didn't glean that from the article.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 11:57 UTC (Fri) by cyphar (subscriber, #110703) [Link]

I don't have a particularly strong justification for the size, it was just an arbitrary* (and probably wrong) choice. I don't really mind how we handle the extensibility of open_how, so long as it means we can expand openat2(2) in the future sanely. One other idea I had was a version or size field that would allow for the struct to be resized (rather than taking up a block of space needlessly), but I have a feeling that would be even less acceptable to most people.

* Though, the struct being 64-bytes overall does mean it fits in one cache-line. That's not a good argument for it to be that big, but it does mean that making it any bigger than 64 bytes would be a much worse idea. I figured that I might as well make it as big as reasonable and see what other ideas people came up with.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 22:21 UTC (Fri) by josh (subscriber, #17465) [Link]

You don't need to have a separate size field, just reserve one of the flags in the flags field to indicate future structure growth. Then, some future version that needs to add more data to the structure can set that flag.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 5:54 UTC (Sat) by buck (subscriber, #55985) [Link]

forgive me if this is covered on the kernel developers list or is a common antipattern. (if so, please feel free to dismiss this with an informative link in reply). but why not start the structure with a protocol version number, like network packets (e.g., IP first nibble, kerberos protocol messages' and SNMP PDUs' first integer)? then you pass in whatever struct is appropriate to the protocol version, with its size (a la socklen_t of bind(2)), and the kernel checks the protocol version at the top to figure out which structure it's been passed and how to interpret

Restricting path name lookup with openat2()

Posted Aug 24, 2019 20:05 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

The size *is* a protocol version number, essentially.

Restricting path name lookup with openat2()

Posted Jan 9, 2021 7:55 UTC (Sat) by Serentty (guest, #132335) [Link]

Why reserve any space at all instead of passing in the size like clone3() does?

Restricting path name lookup with openat2()

Posted Feb 3, 2021 5:50 UTC (Wed) by cyphar (subscriber, #110703) [Link]

clone3 and openat2 use the same design for extension, and Christian and I gave a talk about this design at Linux Plumbers last year. You're commenting on a thread which is almost 2 years old. :P

Restricting path name lookup with openat2()

Posted Aug 23, 2019 12:05 UTC (Fri) by cyphar (subscriber, #110703) [Link]

Great article! The one thing I'd add is that in addition to the kernel work, we need to switch many userspace programs to start handling paths more carefully (it's very hard to get right). To that end, I've been working on a helper library called libpathrs[1] that effectively acts as a much easier-to-use wrapper around RESOLVE_IN_ROOT (with userspace emulation for older kernels). I am quite hopeful that we can eliminate a whole class of path-related attacks by having a more ergonomic library to use.

[1]: https://github.com/openSUSE/libpathrs

Restricting path name lookup with openat2()

Posted Aug 28, 2019 18:56 UTC (Wed) by nix (subscriber, #2304) [Link]

My worry about this stuff is that people will start to use it unthinkingly, breaking real use cases: and with bind-mounts becoming more and more common as a result of containerization and other things, use of RESOLVE_NO_XDEV in particular seems like a disaster waiting to happen unless it is done in specific response to user request (in response to something like an -xdev flag, for instance). At the very least, this should come with some big warnings about careless use and note that users use mount points for all sorts of things, and refusing to traverse them without providing a way to change that decision is at the very least rude.

(In times past, I would have hoped that breaking real use cases could be fixed on a case-by-case basis by raising a bug and fixing the software, but in the current environment I just bet some people would say "Linux is not about choice" and demand that users stop using mount points in ways that break their software instead: after all, their laptop has only one big mount under / so your system should too. It seems best to me to try to stop this sort of thing from happening in the first place.)

Restricting path name lookup with openat2()

Posted Aug 29, 2019 3:44 UTC (Thu) by cyphar (subscriber, #110703) [Link]

We can definitely include a warning to that effect in the man page -- though I would hope that it would be obvious (as it is with RESOLVE_NO_SYMLINKS) that you shouldn't use it everywhere unless you specifically need it for some reason. However, libpathrs doesn't use RESOLVE_NO_XDEV -- only RESOLVE_IN_ROOT (which at the moment implies RESOLVE_NO_MAGICLINKS).

Restricting path name lookup with openat2()

Posted Aug 23, 2019 13:17 UTC (Fri) by walters (subscriber, #7396) [Link]

This looks great! I love the fact that it comes with selftests that try to exploit the race conditions it fixes.

Restricting path name lookup with openat2()

Posted Aug 23, 2019 19:22 UTC (Fri) by k8to (guest, #15413) [Link]

I have a small and foolish question about the buffer slice that must be zeroed. Is this just a prescriptive statement, or will it be checked?

I'm also kind of a newb on system call versioning. I've written code that tests for availability of system calls to decide what path to implement (which was a bit scary across the various architectures, but I got it right in the end), but I haven't written code to try to decide if a particular system call has a particular feature. Is there a better way than kernel version numbers?

Restricting path name lookup with openat2()

Posted Aug 24, 2019 4:46 UTC (Sat) by pr1268 (subscriber, #24648) [Link]

I was curious about that, too, but I didn't ask that in my previous question.

It would seem that (1) callers ensuring that reserved is zeroed, and (2) openat2 enforcing this zeroing would be really expensive.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 5:23 UTC (Sat) by cyphar (subscriber, #110703) [Link]

Yes, openat2(2) checks that it's zeroed. It's not particularly expensive, there's an optimised function for doing checks like that in lib/ (memchr_inv). Zeroing structures and flag bits before passing them to the kernel is already a very common practice (as discussed in this LWN article).

Restricting path name lookup with openat2()

Posted Aug 24, 2019 5:28 UTC (Sat) by cyphar (subscriber, #110703) [Link]

Yes, it is checked (you get -EINVAL if it's non-zero).

As for figuring out whether syscalls have particular features, the best way is to pass the flag and see if you get -EINVAL -- this is why checking whether there are unknown flags present and returning -EINVAL is important in syscall design. If you don't check whether unknown flags are passed, you end up with situations where userspace cannot easily figure out whether the flag is actually supported. open(2) doesn't do this check (which makes it significantly more complicated to figure out whether your kernel supports a particular open(2) feature), but in openat2(2) we do check whether there are unknown O_* flags present.

Restricting path name lookup with openat2()

Posted Aug 24, 2019 20:14 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

This check makes a lot of sense when we're talking about *flags*. But why also check reserved fields when any future use of a reserved field could be signaled through the use of a flag? Can you elaborate on what concrete problems checking this reserved field zero check is supposed to address when any new functionality will get a new flag anyway?

Restricting path name lookup with openat2()

Posted Aug 25, 2019 14:51 UTC (Sun) by cyphar (subscriber, #110703) [Link]

It wouldn't be necessary to add a new flag to represent entirely new features if you have a reserved (must be zeroed) block in the struct. As an example, imagine if open_how->resolve wasn't included in this version of the series (and open_how->reserved was one u16 bigger). In a later kernel version we could introduce open_how->resolve as a field and shrink open_how->reserved -- as long as the zero-value had identical semantics there wouldn't be a break in backwards-compatibility and new programs could use the new field (getting -EINVAL if they run on older kernels).

All of that being said, I am gravitating towards not having reserved space. I don't have particularly strong opinions either way.

Restricting path name lookup with openat2()

Posted Aug 26, 2019 19:18 UTC (Mon) by k8to (guest, #15413) [Link]

Thanks. That makes sense. Probably the manpage will tell me to do just that, and I would have figured it out upon reading it.

Restricting path name lookup with openat2()

Posted Mar 30, 2020 16:09 UTC (Mon) by gb (subscriber, #58328) [Link]

Very strange approach - let's introduce my own function to send params via struct.

Why not introduce flag 'generate error on unknown flags' and introduce new syscall once you really run out of flags not in advance?

Restricting path name lookup with openat2()

Posted Mar 30, 2020 23:14 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

Because if you didn't have that flag before (so unknown flags were ignored) then introducing a flag to make unknown flags generate an error would be a breaking change: programs that worked before while setting that bit along with one or more other unknown flags bits would suddenly stop working.

If you don't make unknown flags an error up front then the unused bits basically become permanently reserved. You can never safely *stop* ignoring them.

Restricting path name lookup with openat2()

Posted Mar 31, 2020 7:07 UTC (Tue) by scientes (guest, #83068) [Link]

> Why not introduce flag 'generate error on unknown flags'

Becuase that unknown flag could have been passed by a program before it was introduced....