Appendix B
Examples catastrophic backtracking

Introduction

In this appendix the evolvement of execution steps of some examples of catastrophic backtracking of regular expressions are visualized in a table. The described steps are how a human could understand the search iteration process. A regex debugger and/or regex engine may internally break down these steps in smaller steps and use intermediate steps that are not essential to understand the process.

Example (x+x+)+y

An often quoted example of catastrophic backtracking[1] involving regular expression (x+x+)+y applied to the string "xxxxxxxxxx" (10 x's).


const regex = /(x+x+)+y/;
const string = "xxxxxxxxxx"; // 10 x's
(x+x+) (x+x+) ... y explanation
x+ x+ x+ x+ ...
xxxxxxxxxx fail The first x+ matches all 10 x's. The second x+ fails. The first x+ backtracks one x.
xxxxxxxxx x fail fail The first x+ matches 9 x's. The second x+ matches one x. The group succeeds the first try (which is sufficient), the second try fails, since all characters have been matched already. The y fails, since all characters have been matched already. The second x+ cannot backtrack, because it matched only one x. The first x+ backtracks one x again.
xxxxxxxx xx fail fail The first x+ matches 8 x's. The second x+ matches xx. The group succeeds the first try (which is sufficient), the second group try fails. The y fails. The second x+ backtracks one x.
xxxxxxxx x x fail fail The second x+ now matches one x. The group matches and tries a second match. The first x+ of the second group try matches the leftover x, but the second x+ makes this second group try fail. The first group succeeded, which was sufficient, but the y fails (no characters left). The first x+ backtracks one x again, since the other 2 only matched one x.
xxxxxxx xxx fail fail The first x+ now matches 7 x's. The remaining x's are matched by the second x+. The group matches the first time, but a second try fails (no more characters left). The y fails as well. The second x+ backtracks one x.
xxxxxxx xx x fail fail The second x+ now matches xx. The group matches and tries a second match. The first x+ of the second group try matches the leftover x, but the second x+ makes this second group try fail (no characters left). The y fails as well. The second x+ backtracks one x.
xxxxxxx x xx fail fail The second x+ now matches x. The group matches and tries a second match. The first x+ of the second group try matches the leftover xx, but the second x+ makes this second group try fail (no characters left). The y fails as well. The first x+ of the second group backtracks one x.
xxxxxxx x x x fail Now the group succeeded twice for the first time; (7x,1x)(1x,1x). But still the y fails; no more characters left to match. The first x+ backtracks to 6 x's.
xxxxxx xxxx fail fail I think you've caught the drift by now...
xxxxxx xxx x fail fail
xxxxxx xx xx fail fail
xxxxxx xx x x fail (6x,2x)(1x,1x)
xxxxxx x xxx fail fail
xxxxxx x xx x fail (6x,1x)(2x,1x)
xxxxxx x x xx fail (6x,1x)(1x,2x)
etc. etc. etc.

After 2558 steps the engine finally figures out that no match can be found. Add one character "x" to the string and the engine needs 5118 steps, 12 x's need 10238 steps etc. So, lengthen the string by one character doubles the number of needed steps. The relation between the string length and the number of needed steps is exponential. With 19 x's, more than a million steps are needed.

Example ^(\d+)*$

An example of backtracking[2] involving regular expression ^(\d+)*$ applied to the strings "123x" and "1234x".


const regex = /^(\d+)*$/;
const string = "123x";

console.log(regex.exec(string));  // logs: null
step (\d+) (\d+) (\d+) (\d+) ... $ Explanation
1 123 fail fail \d+ matches "123". The group succeeds the first try (which is sufficient), the second try fails, since all digits have been matched already. The "x" fails, since it is not the end of the string ($). It backtracks into the group and subsequently one step into \d+ releasing "3".
2 12 3 fail fail The first match of the group is still intact (although "3" is out). The engine emerges to the second try of the group. The second try of the group matches the "3", the third try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the second group match, but there is only one character to give up, so it backtracks further to the first group match and then one step into \d+ to release "2".
3 1 23 fail fail The first match of the group is still a match. The engine emerges to the second try of the group. The second try of the group matches the "2" and "3", the third try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the second group match and then one step into \d+ to release "3".
4 1 2 3 fail fail The first match of the group is still a match. The second match of the group is still a match. The engine emerges to the third try of the group. The third try of the group matches the "3", the fourth try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the third group match and gives up this entire group match.
5 1 2 fail The first match of the group is still a match. The second match of the group is still a match. The "3" fails, since it is not the end of the string ($). The engine backtracks to the second group match and gives up this entire group match.
6 1 fail The first match of the group is still a match. The "2" fails, since it is not the end of the string ($). The engine backtracks to the first group match and gives up this entire group match.
7 fail The "1" fails, since it is not the end of the string ($). Now all the backtracking led to a fail. The engine returns to the beginning of the regex again and starts with the second character in the string all over again. However, the first element in the regex (^) only matches the first character in the string. Starting with a character other than the first one will always fail.

const regex = /^(\d+)*$/;
const string = "1234x";

console.log(regex.exec(string));  // logs: null
step (\d+) (\d+) (\d+) (\d+) ... $ Explanation
1 1234 fail fail
2 123 4 fail fail
3 12 34 fail fail
4 12 3 4 fail fail
5 1 234 fail fail
6 1 23 4 fail fail
7 1 2 34 fail fail
8 1 2 3 4 fail
9 1 2 3 fail
10 1 2 fail
11 1 fail
12 fail

Matching "123x" took 7 steps, matching "1234x" took 12 steps. The number of steps roughly doubles when one digit is added to the string. Following the above scheme, matching takes 2(n-1)+n steps, where n represents the number of digits in the string. The relation between the string length and the number of needed steps is exponential.

With short strings as in the above examples, the backtracking does not become catastrophic. But with only 21 digits, the number of steps is already more than a million. With 31 digits the number of steps exceeds the thousand million (109).

Example (\d+)*$

The same as the former example, but now without the "match the beginning of input" ^.


const regex = /(\d+)*$/;
const string = "123x";

console.log(regex.exec(string));  // logs: [ "" ] // empty string
step (\d+) (\d+) (\d+) (\d+) ... $ Explanation
1 123 fail fail \d+ matches "123". The group succeeds the first try (which is sufficient), the second try fails, since all digits have been matched already. The "x" fails, since it is not the end of the string ($). It backtracks into the group and subsequently one step into \d+ releasing "3".
2 12 3 fail fail The first match of the group is still intact (although "3" is out). The engine emerges to the second try of the group. The second try of the group matches the "3", the third try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the second group match, but there is only one character to give up, so it backtracks further to the first group match and then one step into \d+ to release "2".
3 1 23 fail fail The first match of the group is still a match. The engine emerges to the second try of the group. The second try of the group matches the "2" and "3", the third try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the second group match and then one step into \d+ to release "3".
4 1 2 3 fail fail The first match of the group is still a match. The second match of the group is still a match. The engine emerges to the third try of the group. The third try of the group matches the "3", the fourth try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the third group match and gives up this entire group match.
5 1 2 fail The first match of the group is still a match. The second match of the group is still a match. The "3" fails, since it is not the end of the string ($). The engine backtracks to the second group match and gives up this entire group match.
6 1 fail The first match of the group is still a match. The "2" fails, since it is not the end of the string ($). The engine backtracks to the first group match and gives up this entire group match.
7 fail The "1" fails, since it is not the end of the string ($). Now all the backtracking led to a fail. The engine returns to the beginning of the regex again and starts with the second character in the string all over again.
8 23 fail fail \d+ matches "23". The group succeeds the first try (which is sufficient), the second try fails, since all digits have been matched already. The "x" fails, since it is not the end of the string ($). It backtracks into the group and subsequently one step into \d+ releasing "3".
9 2 3 fail fail The first match of the group is still intact (although "3" is out). The engine emerges to the second try of the group. The second try of the group matches the "3", the third try fails, since all digits have been matched. The "x" fails, since it is not the end of the string ($). The engine backtracks to the second group match, but there is only one character to give up, so it backtracks further to the first group match which also have only one character to give up. The engine returns to the beginning of the regex again and starts with the third character in the string.
10 3 fail fail \d+ matches "3". The group succeeds the first try (which is sufficient), the second try fails, since all digits have been matched already. The "x" fails, since it is not the end of the string ($). It backtracks into the group and subsequently one step into \d+ where there is only one character to give up. The engine returns to the beginning of the regex again and starts with the last character in the string.
11 fail The group fails to match, but it passes, since zero matches is sufficient with *. Then it tries to match "z" with "end of string" $, which fails. The engine returns to the beginning of the regex again and starts at the end of the string.
12 The group fails to match, but it passes, since zero matches is sufficient with *. Then it tries to match the end of the string with "end of string" $, which succeeds. The engine returns a match; the end of the string. Some engines will report the match as an empty string, some as a null.

const regex = /(\d+)*$/;
const string = "1234x";

console.log(regex.exec(string));  // logs: [ "" ] // empty string
step (\d+) (\d+) (\d+) (\d+) ... $ Explanation
1 1234 fail fail
2 123 4 fail fail
3 12 34 fail fail
4 12 3 4 fail fail
5 1 234 fail fail
6 1 23 4 fail fail
7 1 2 34 fail fail
8 1 2 3 4 fail
9 1 2 3 fail
10 1 2 fail
11 1 fail
12 fail
13 234 fail fail
14 23 4 fail fail
15 2 34 fail fail
16 2 3 4 fail
17 34 fail fail
18 3 4 fail fail
19 4 fail fail
20 fail
21

Note that when the regex would have been (\d+)+$, the match would have failed.