Qualys Security Advisory

System Down: A systemd-journald exploit


========================================================================
Contents
========================================================================

Summary
CVE-2018-16864
- Analysis
- Exploitation
CVE-2018-16865
- Analysis
- Exploitation
CVE-2018-16866
- Analysis
- Exploitation
Combined Exploitation of CVE-2018-16865 and CVE-2018-16866
- amd64 Exploitation
- i386 Exploitation
Acknowledgments
Timeline

    Conversion, software version 7.0
        -- System of a Down, "Toxicity"


========================================================================
Summary
========================================================================

We discovered three vulnerabilities in systemd-journald
(https://en.wikipedia.org/wiki/Systemd):

- CVE-2018-16864 and CVE-2018-16865, two memory corruptions
  (attacker-controlled alloca()s);

- CVE-2018-16866, an information leak (an out-of-bounds read).

CVE-2018-16864 was introduced in April 2013 (systemd v203) and became
exploitable in February 2016 (systemd v230). We developed a proof of
concept for CVE-2018-16864 that gains eip control on i386.

CVE-2018-16865 was introduced in December 2011 (systemd v38) and became
exploitable in April 2013 (systemd v201). CVE-2018-16866 was introduced
in June 2015 (systemd v221) and was inadvertently fixed in August 2018.

We developed an exploit for CVE-2018-16865 and CVE-2018-16866 that
obtains a local root shell in 10 minutes on i386 and 70 minutes on
amd64, on average. We will publish our exploit in the near future.

To the best of our knowledge, all systemd-based Linux distributions are
vulnerable, but SUSE Linux Enterprise 15, openSUSE Leap 15.0, and Fedora
28 and 29 are not exploitable because their user space is compiled with
GCC's -fstack-clash-protection.

This confirms https://grsecurity.net/an_ancient_kernel_hole_is_not_closed.php:
"It should be clear that kernel-only attempts to solve [the Stack Clash]
will necessarily always be incomplete, as the real issue lies in the
lack of stack probing."


========================================================================
CVE-2018-16864
========================================================================

------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------

    The waves all keep on crashing by
        -- System of a Down, "Suggestions"

We accidentally discovered CVE-2018-16864 while working on the exploit
for Mutagen Astronomy (CVE-2018-14634); if we pass several megabytes of
command-line arguments to a program that calls syslog(), then journald
crashes:

systemd-journal[472]: segfault at 7ffe9a077420 ip 00007f45f6174877 sp 00007ffe9a0773f0 error 6 in systemd-journald[7f45f6169000+3f000]

(gdb) disassemble 0x7f45f6174877 - 0x7f45f6169000
Dump of assembler code for function dispatch_message_real.4064:
   ...
   0x000000000000b82c <+988>:   callq  0x2bd10 <get_process_cmdline.constprop.96>
   0x000000000000b831 <+993>:   test   %eax,%eax
   0x000000000000b833 <+995>:   js     0xb8ea <dispatch_message_real.4064+1178>
   0x000000000000b839 <+1001>:  mov    -0x218(%rbp),%rbx
   0x000000000000b840 <+1008>:  test   %rbx,%rbx
   0x000000000000b843 <+1011>:  je     0xd31b <dispatch_message_real.4064+7883>
   0x000000000000b849 <+1017>:  mov    %rbx,%rdi
   0x000000000000b84c <+1020>:  callq  0x5360 <strlen@plt>
   0x000000000000b851 <+1025>:  add    $0xa,%eax
   0x000000000000b854 <+1028>:  cltq
   0x000000000000b856 <+1030>:  add    $0x1e,%rax
   0x000000000000b85a <+1034>:  and    $0xfffffffffffffff0,%rax
   0x000000000000b85e <+1038>:  sub    %rax,%rsp
   0x000000000000b861 <+1041>:  movabs $0x454e494c444d435f,%rax
   0x000000000000b86b <+1051>:  lea    0x37(%rsp),%r15
   0x000000000000b870 <+1056>:  and    $0xfffffffffffffff0,%r15
   0x000000000000b874 <+1060>:  test   %rbx,%rbx
   0x000000000000b877 <+1063>:  mov    %rax,(%r15)
   0x000000000000b87a <+1066>:  mov    $0x3d,%eax
   0x000000000000b87f <+1071>:  mov    %ax,0x8(%r15)
   0x000000000000b884 <+1076>:  lea    0x9(%r15),%rax
   0x000000000000b888 <+1080>:  je     0xb895 <dispatch_message_real.4064+1093>
   0x000000000000b88a <+1082>:  mov    %rbx,%rsi
   0x000000000000b88d <+1085>:  mov    %rax,%rdi
   0x000000000000b890 <+1088>:  callq  0x5370 <stpcpy@plt>

538 static void dispatch_message_real(
...
604                 r = get_process_cmdline(ucred->pid, 0, false, &t);
605                 if (r >= 0) {
606                         x = strjoina("_CMDLINE=", t);

919 #define strjoina(a, ...)                                                \
920         ({                                                              \
921                 const char *_appendees_[] = { a, __VA_ARGS__ };         \
922                 char *_d_, *_p_;                                        \
923                 int _len_ = 0;                                          \
924                 unsigned _i_;                                           \
925                 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \
926                         _len_ += strlen(_appendees_[_i_]);              \
927                 _p_ = _d_ = alloca(_len_ + 1);                          \
928                 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \
929                         _p_ = stpcpy(_p_, _appendees_[_i_]);            \
930                 *_p_ = 0;                                               \
931                 _d_;                                                    \
932         })

This vulnerability, an attacker-controlled alloca()
(https://wiki.sei.cmu.edu/confluence/display/c/MEM05-C.+Avoid+large+stack+allocations)
at instruction 0xb85e and line 927, was introduced in systemd v203:

commit ae018d9bc900d6355dea4af05119b49c67945184
Date:   Mon Apr 22 23:10:13 2013 -0300
...
                 r = get_process_cmdline(ucred->pid, 0, false, &t);
                 if (r >= 0) {
-                        cmdline = strappend("_CMDLINE=", t);
+                        cmdline = strappenda("_CMDLINE=", t);

(strappenda() was renamed strjoina() in systemd v219) and became
exploitable in systemd v230:

commit ac2e41f5103ce2c679089c4f8fb6be61d7caec07
Date:   Fri Feb 12 04:59:57 2016 -0800
...
    This adds a wait flag to journal_file_set_offline(), when false the offline is
    performed asynchronously in a separate thread.

------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------

    ... it's the race
    Can you break out?
        -- System of a Down, "36"

CVE-2018-16864 is similar to a Stack Clash vulnerability
(https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt), but:

- Steps 1 (Clash the stack with another memory region) and 2 (Run the
  stack pointer to the start of the stack) are not needed, because the
  attacker-controlled alloca() can be very large (several megabytes of
  command-line arguments); only Steps 3 (Jump over the stack guard page,
  into another memory region) and 4 (Smash the stack, or another memory
  region) are needed.

- In Step 4 (Smash), the alloca() is fully written to (the vulnerability
  is essentially a stpcpy(alloca(strlen(cmdline) + 1), cmdline)), and
  the stpcpy() (a "wild copy") will therefore always crash into a
  read-only or unmapped memory region:

  https://googleprojectzero.blogspot.com/2015/03/taming-wild-copy-parallel-thread.html
  https://cansecwest.com/slides/2015/Taming%20wild%20copies%20-%20Chris%20evans.pdf

We tried to asynchronously interrupt this stpcpy() before it crashes,
with a signal or a timer, but we failed because journald uses signalfd()
and timerfd_create() to handle these events synchronously.

We eventually gained control of eip (i386's instruction pointer) by
jumping into and smashing the stack of a concurrent thread (a "Parallel
Thread Corruption"):

- First, we send a large, high-priority message (LOG_CRIT or higher) to
  journald, from a process whose cmdline is small; this message forces a
  large write() (between 1MB and 2MB) to /var/log/journal/ and forces
  the creation of a short-lived thread that fsync()s the journal (the
  stack of this thread is allocated in the mmap region).

- Next, we create several processes (between 32 and 64) that write() and
  fsync() large files (between 1MB and 8MB) to /var/tmp/ (for example);
  these processes stall journald's fsync() thread and will allow us to
  win a tight race: exploit the "wild copy" before it crashes.

- Last, we send a small, low-priority message to journald, from a
  process whose cmdline is very large (roughly 128MB, the distance
  between the main stack and the mmap region); this message forces a
  very large alloca() that jumps from journald's main stack into the
  stack of the fsync() thread, and smashes a saved eip before fsync()
  returns from kernel space.

On a Debian stable (9.5), our proof of concept wins this race and gains
eip control after a dozen tries (systemd automatically restarts journald
after each crash):

systemd-journal[2195]: segfault at 41414141 ip 41414141 sp b5f3d22c error 14

Despite this initial success, we abandoned the exploitation of
CVE-2018-16864: while working on our proof of concept, we discovered two
different vulnerabilities (CVE-2018-16865, another attacker-controlled
alloca(), and CVE-2018-16866, an information leak) that are reliably
exploitable on both i386 and amd64.


========================================================================
CVE-2018-16865
========================================================================

------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------

    Can you feel their haunting presence?
        -- System of a Down, "Holy Mountains"

Surprised by the heavy usage of alloca() in journald, we searched for
another attacker-controlled alloca() and found CVE-2018-16865:

1963 int journal_file_append_entry(JournalFile *f, const dual_timestamp *ts, const struct iovec iovec[], unsigned n_iovec, uint64_t *seqnum, Object **ret, uint64_t *offset) {
....
1986         items = alloca(sizeof(EntryItem) * MAX(1u, n_iovec));
1987
1988         for (i = 0; i < n_iovec; i++) {
1989                 uint64_t p;
1990                 Object *o;
1991
1992                 r = journal_file_append_data(f, iovec[i].iov_base, iovec[i].iov_len, &o, &p);
1993                 if (r < 0)
1994                         return r;
1995
1996                 xor_hash ^= le64toh(o->data.hash);
1997                 items[i].object_offset = htole64(p);
1998                 items[i].hash = o->data.hash;
1999         }

This vulnerability was introduced in systemd v38:

commit cf244689e9d1ab50082c9ddd0f3c4d1eb982badc
Date:   Thu Dec 29 15:00:57 2011 +0100
...
-        items = new(EntryItem, n_iovec);
-        if (!items)
-                return -ENOMEM;
+        items = alloca(sizeof(EntryItem) * n_iovec);

and became exploitable in systemd v201:

commit c4aa09b06f835c91cea9e021df4c3605cff2318d
Date:   Mon Apr 8 20:32:03 2013 +0200
...
-#define ENTRY_SIZE_MAX (1024*1024*64)
-#define DATA_SIZE_MAX (1024*1024*64)
...
+#define ENTRY_SIZE_MAX (1024*1024*768)
+#define DATA_SIZE_MAX (1024*1024*768)

If we send a large "native" message to /run/systemd/journal/socket:
since the maximum size of a "native" entry is 768MB, and the minimum
length of a "native" item is 3 ("A=\n"), and the size of an EntryItem
structure is 16 (a 64-bit offset and a 64-bit hash), the maximum size of
the attacker-controlled alloca() in journal_file_append_entry() is 768MB
/ 3 * 16 = 4GB, large enough to jump from journald's main stack into the
mmap region, even on amd64.

On amd64, as described in the "64-bit exploitation" of our Stack Clash
advisory, the randomized distance between the main stack and the mmap
region is shorter than 4GB with a probability of (approximately):

SUM(d = 0; d < 4GB; d++) d / (16GB * 1TB) ~= 1 / 2048

------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------

    Jump (pogo, pogo, pogo, pogo, pogo, pogo, pogo)
        -- System of a Down, "Bounce"

CVE-2018-16865 is basically a simplified Stack Clash vulnerability:

- Steps 1 (Clash) and 2 (Run) of the Stack Clash are not needed, since
  the largest attacker-controlled alloca() is 4GB; only Steps 3 (Jump)
  and 4 (Smash) are needed.

- In Step 4 (Smash), the alloca() is not necessarily fully written to:
  if the size of an item is larger than 128MB (DEFAULT_MAX_SIZE_UPPER),
  then journal_file_append_data() returns an error that breaks the "for"
  loop in journal_file_append_entry() (at lines 1992-1994) and avoids a
  crash into a read-only or unmapped memory region.

We eventually transformed this vulnerability into a crude
"write-what-where" (https://cwe.mitre.org/data/definitions/123.html):

- "write-where": We jump into and smash libc's read-write segment, and
  thereby overwrite a function pointer. Unfortunately this "write-where"
  is not surgical: the stack frames of the functions called from within
  the "for" loop (in journal_file_append_entry()) smash a few kilobytes
  below our target function pointer, and therefore overwrite vital libc
  variables that may crash or deadlock journald. Consequently, we must
  sometimes shift our alloca() jump slightly, to avoid overwriting such
  vital variables.

- "write-what": We want to overwrite our target function pointer with
  the address of another function or ROP chain, but unfortunately the
  stack frames of the functions called from within the "for" loop (in
  journal_file_append_entry()) do not contain any data that we control.
  However, the 64-bit "hash" values that are written to the alloca()ted
  "items" are produced by jenkins_hashlittle2(), a noncryptographic hash
  function: we can easily find a short string (a preimage) that hashes
  to a given value (the address that will overwrite our target function
  pointer) and is also a valid_user_field() (or journal_field_valid()).

  This "write-what" restricts our "write-where" to function pointers
  whose address modulo 16 is equal to 8 (the offset of "hash" in the
  EntryItem structure).

To complete our exploit, we need the address of journald's stack pointer
before the alloca() jump, and the address of our target function pointer
in libc's read-write segment -- we need an information leak.


========================================================================
CVE-2018-16866
========================================================================

------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------

    When they speak, we can peek from the windows of their mouths
        -- System of a Down, "Know"

We discovered an out-of-bounds read in journald (CVE-2018-16866), and
transformed it into an information leak:

 31 #define WHITESPACE        " \t\n\r"
...
194 size_t syslog_parse_identifier(const char **buf, char **identifier, char **pid) {
195         const char *p;
...
197         size_t l, e;
...
203         p = *buf;
204
205         p += strspn(p, WHITESPACE);
206         l = strcspn(p, WHITESPACE);
207
208         if (l <= 0 ||
209             p[l-1] != ':')
210                 return 0;
211
212         e = l;
...
240         if (strchr(WHITESPACE, p[e]))
241                 e++;
242         *buf = p + e;
243         return e;
244 }

If we send a syslog message to journald (in *buf), and if the last
character of this message is a ':' (before the '\0' terminator), then:

- at line 240, p[e] is the '\0' terminator of our message;

- at line 240, strchr(WHITESPACE, p[e]) returns a pointer to the '\0'
  terminator of the WHITESPACE string (as mentioned in man strchr: "The
  terminating null byte is considered part of the string, so that if c
  is specified as '\0', these functions return a pointer to the
  terminator.");

- at line 241, e is incremented;

- at line 242, *buf points out-of-bounds, to the first character after
  the '\0' terminator of our message;

- later, the out-of-bounds string at *buf (supposedly the body of our
  syslog message) is written (leaked) to the journal.

Consequently, we can read this out-of-bounds string:

- either directly from the journal (if journald's "Storage" is
  "persistent", or "auto" and /var/log/journal/ exists), because
  journald supports extended file ACLs (Access Control Lists):

  $ id
  uid=1000(john) gid=1000(john) groups=1000(john) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

  $ ls -l /var/log/journal/*/user-$UID.journal
  -rw-r-----+ 1 root systemd-journal 8388608 Nov 20 09:35 /var/log/journal/2562d1eced654f44a3d3a217d66b9ff3/user-1000.journal

  $ getfacl /var/log/journal/*/user-$UID.journal
  ...
  user:john:r--

  $ ./infoleak

  $ journalctl --all --user --lines=1 --identifier=infoleak | hexdump -C
  ...
  00000050  2e 20 2d 2d 0a 4e 6f 76  20 32 30 20 31 36 3a 30  |. --.Nov 20 16:0|
  00000060  30 3a 33 36 20 6c 6f 63  61 6c 68 6f 73 74 2e 6c  |0:36 localhost.l|
  00000070  6f 63 61 6c 64 6f 6d 61  69 6e 20 69 6e 66 6f 6c  |ocaldomain infol|
  00000080  65 61 6b 5b 33 35 34 38  5d 3a 20 78 fb 1e 78 54  |eak[3548]: x..xT|
  00000090  7f 0a                                             |..|

- or (if journald's "Storage" is "volatile", or "auto" and
  /var/log/journal/ does not exist) from a tty that we recorded to
  /var/run/utmp, because journald writes ("walls") emergency messages
  (LOG_EMERG) to the tty of every logged-in user; our exploit records a
  tty to /var/run/utmp via an ssh connection to localhost, but other
  methods exist (for example, utempter and gnome-pty-helper):

  $ ./infoleak
  ...
  00003510  0a 07 0d 0d 0a 42 72 6f  61 64 63 61 73 74 20 6d  |.....Broadcast m|
  00003520  65 73 73 61 67 65 20 66  72 6f 6d 20 73 79 73 74  |essage from syst|
  00003530  65 6d 64 2d 6a 6f 75 72  6e 61 6c 64 40 6c 6f 63  |emd-journald@loc|
  00003540  61 6c 68 6f 73 74 2e 6c  6f 63 61 6c 64 6f 6d 61  |alhost.localdoma|
  00003550  69 6e 20 28 54 75 65 20  32 30 31 38 2d 31 31 2d  |in (Tue 2018-11-|
  00003560  32 30 20 31 36 3a 32 35  3a 34 36 20 43 53 54 29  |20 16:25:46 CST)|
  00003570  3a 0d 0d 0a 0d 0d 0a 69  6e 66 6f 6c 65 61 6b 5b  |:......infoleak[|
  00003580  33 38 37 32 5d 3a 20 78  6b a2 e1 2f 7f 0d 0d 0a  |3872]: xk../....|

This vulnerability was introduced in systemd v221:

commit ec5ff4445cca6a1d786b8da36cf6fe0acc0b94c8
Date:   Wed Jun 10 22:33:44 2015 -0700
...
-        e += strspn(p + e, WHITESPACE);
+        if (strchr(WHITESPACE, p[e]))
+                e++;

and was inadvertently fixed in August 2018:

commit a6aadf4ae0bae185dc4c414d492a4a781c80ffe5
Date:   Wed Aug 8 15:06:36 2018 +0900
...
-        if (strchr(WHITESPACE, p[e]))
-                e++;
+        e += strspn(p + e, WHITESPACE);

commit 8595102d3ddde6d25c282f965573a6de34ab4421
Date:   Fri Aug 10 11:07:54 2018 +0900
...
-        e += strspn(p + e, WHITESPACE);
+        /* Single space is used as separator */
+        if (p[e] != '\0' && strchr(WHITESPACE, p[e]))
+                e++;

------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------

    For today we will take the body parts and put them on the wall
        -- System of a Down, "Dreaming"

To leak a stack address or an mmap address from journald:

- First, we send a large native message to /run/systemd/journal/socket;
  journald mmap()s our message, and malloc()ates a large array of iovec
  structures: most of these structures point into our mmap()ed message,
  but some of them point to the stack (in dispatch_message_real()). The
  contents of this iovec array (especially the mmap and stack pointers)
  are preserved in a heap hole after free() (after journald finishes
  processing our message).

- Next, we send a large syslog message to /run/systemd/journal/dev-log;
  to receive our large message (in server_process_datagram()), journald
  realloc()ates its server buffer into the heap hole that previously
  contained the iovec array (and still contains remains of mmap and
  stack pointers).

- Last, we send a large syslog message that exploits CVE-2018-16866;
  journald receives our large message in its server buffer (in the heap
  chunk that previously contained the iovec array), and if we carefully
  choose the size of our message and position its terminating ":" in
  front of a remaining mmap or stack pointer, then we can leak this
  pointer (it is mistakenly read out-of-bounds as the body of our
  message).

From this leaked stack pointer we easily deduce journald's stack pointer
before the alloca() jump, because the distance between the two depends
only on journald's executable.

From the leaked mmap address we can deduce libc's address, but chunks of
unknown sizes are mmap()ed between the two, and we must therefore adopt
different strategies based on our target architecture (i386 or amd64).


========================================================================
Combined Exploitation of CVE-2018-16865 and CVE-2018-16866
========================================================================

    Don't leave your seats now
    Popcorn everywhere ...
        -- System of a Down, "CUBErt"

------------------------------------------------------------------------
amd64 Exploitation
------------------------------------------------------------------------

- To deduce libc's address from the leaked mmap address of our native
  message, we arrange for this message to be mmap()ed into the 2MB hole
  between ld.so's read-execute and read-only segments: from this hole's
  address we deduce ld.so's address, and hence libc's address (with help
  from ldd's output).

- If the resulting stack-to-libc distance is jumpable (if it is shorter
  than 4GB), then we proceed with our "write-what-where"; otherwise, we
  restart journald (we crash it with an alloca() of RLIMIT_STACK -- 8MB
  by default) and try again.

  We have a good chance of obtaining a jumpable stack-to-libc distance
  (and hence a root shell) after 2048 tries * 2 seconds ~= 68 minutes
  (by default, if journald crashes less than 5 times within 10 seconds,
  it is restarted automatically by systemd).

- For the "write-where" part of our "write-what-where", we overwrite
  libc's __free_hook function pointer, whose address modulo 16 is always
  equal to 8 (on every amd64 distribution that we exploited).

- For the "write-what" part of our "write-what-where", we overwrite
  __free_hook with the address of libc's system() function: whenever
  journald free()s data that we control, we achieve arbitrary command
  execution.

Last-minute note: on CentOS 7, the usual function pointers in libc's
read-write segment (__free_hook, __malloc_hook, etc) are not located at
multiples of 16 plus 8. To circumvent this problem:

- First, we overwrite the "_chain" pointer of stderr's FILE structure
  with the address of our own fake FILE structure (this "_chain" pointer
  is located at a multiple of 16 plus 8, in libc's read-write segment).

- Next, we corrupt one of malloc's internal variables (also in libc's
  read-write segment).

- Last, we force a call to malloc() or free(), which detects the
  corruption of its internal variable and calls abort(), which calls
  _IO_flush_all_lockp(), which follows stderr's overwritten "_chain"
  pointer to our fake FILE structure; we eventually achieve arbitrary
  command execution by calling libc's system() via one of the function
  pointers in our fake FILE structure.

------------------------------------------------------------------------
i386 Exploitation
------------------------------------------------------------------------

Our i386 exploit is very similar to the amd64 exploit, but:

- The stack-to-libc distance is always jumpable (it is roughly 128MB).

- There is no hole between ld.so's read-execute and read-only segments.
  However, libc's address is randomized in a narrow range of 1MB and is
  therefore brute forcible: we have a good chance of correctly guessing
  libc's address after 1MB / 4KB = 256 tries * 2 seconds ~= 8 minutes.

- For the "write-where" part of our "write-what-where", we overwrite
  libc's __malloc_hook function pointer (__free_hook was never located
  at a multiple of 16 plus 8 or 12 on the i386 distributions that we
  exploited, but __malloc_hook always is).

- For the "write-what" part of our "write-what-where", we overwrite
  __malloc_hook with the address of a "mov esp, 0x89fffa5d ; ret" gadget
  (or equivalent stack pivot): since our native message can be as large
  as 768MB, we can mmap() it at 0x89fffa5d, take control of the stack,
  and return into libc's execve().


========================================================================
Acknowledgments
========================================================================

We thank systemd's developers, Red Hat Product Security, and the members
of linux-distros@openwall.


========================================================================
Timeline
========================================================================

2018-11-26: Advisory sent to Red Hat Product Security (as recommended by
https://github.com/systemd/systemd/blob/master/docs/CONTRIBUTING.md#security-vulnerability-reports).

2018-12-26: Advisory and patches sent to linux-distros@openwall.

2019-01-09: Coordinated Release Date (6:00 PM UTC).