Regular expressions (regexes) are widely used across computer science applications and software components. However, not all programming languages are immune to regular expression denial-of-service (ReDoS) attacks. These algorithmic complexity attacks target the core of the regex engines by chewing up the resources with regex evaluations that take super-linear time to operate, thereby denying invaluable resources to legitimate clients. The paper discusses the addition of a caching mechanism to Spencer’s regex engine implementation that practically reduces most super-linear regex evaluations to linear evaluations. This drastically reduces the attack surface area for ReDoS and also improves performance for most evaluations. Software designers delivering modules that rely heavily on regex evaluations should find these conclusions interesting.
Popular programming languages like Python, JavaScript, and Ruby use regex engines that implement Spencer’s algorithm. The worst-case time complexity for Spencer’s is super-linear for 19 to 38 percent of regular expressions. Nevertheless, 95 percent of these super-linear regexes can be evaluated linearly. On the one hand, language developers prefer Spencer’s for extensibility; software engineers, on the other hand, rely on heavyweight regexes. Hence there is a need to make these regex engines safe from ReDoS.
The paper describes the method and prototype results for adding a cache “in a simple backtracking regex engine implementation published by Cox.” While Spencer’s method blows up with super-linear regexes, the cache-based implementation performs linearly in the worst case. Though a significant number of real regexes depict worst-case super-linear behavior, it is entirely possible to support these at linear time with a modest space cost.