Basics of regular expressions
What are Regular expressions?
In this example of appendix A a regular expression is used to check if a password, typed in by the user, contains at least one number, one uppercase letter, one lowercase letter, one special character and at least 8 characters. Later this chapter this example will be explained. Let's first look at a simpler example to get some idea of what regular expressions are.
The JavaScript wrapper object String
provides methods to manipulate a string.
Manipulating a string may require searching for some combination(s) of characters, within that string.
You may, for instance, want all instances of a specific combination of characters in a text to be replaced by an other combination of characters.
const myString = `I'm nobody! Who are you? Are you nobody, too?`;
const regEx = /nobody/gi;
console.log(myString.replace(regEx, 'somebody'));
// logs: "I'm somebody! Who are you? Are you somebody, too?"
In this example a new string was created with all the words "nobody" in the original string replaced by "somebody" in the new string.
This needed a search for all words "nobody" in the string.
To tell the method replace
what to search for, a regular expression was provided.
This is the expression /nobody/gi
in the example.
This regular expression specifies the search pattern, i.e., what character combinations to search for in the string.
Regular expressions are widely supported in text processing applications and in the more advanced text editors. Support of regular expressions is also implemented in most high level programming languages. Regular expressions follow the rules of a formalized standardized syntax, using constructions easy to type on a standard ASCII keyboard. JavaScript regular expressions follow a widely used Perl-like syntax. More about the syntax in the next paragraphs.
JavaScript objects
In JavaScript regular expressions are objects.
const regEx = /nobody/gi;
console.log(typeof regEx); // logs: "object"
There are two ways to create a regular expression object: by using a literal notation and by using an object constructor function.
- Regular expression literal: The parameters are enclosed between forward slashes, optionally followed by flags, and no quotation marks are used (it is not a string).
-
RegExp()
constructor function: The first argument includes the parameters, denoted as either a string literal or as a regular expression literal (without flags). The second argument includes the flags, denoted as a string literal.
const regExLit = /nobody/gi; // Regular expression literal
let regExFuncStr = new RegExp("nobody", "gi"); // Object constructor function
let regExFuncRe = new RegExp(/nobody/, "gi"); // Object constructor function
The literal notation results in compilation of the regular expression when the script is loaded. Use literal notation when the regular expression will remain constant.
The constructor results in runtime compilation of the regular expression. Use the constructor function only when you know the regular expression pattern will be changing. Using the constructor function when the regular expression will remain constant may unnecessarily affect performance in a negative way. For instance, constructing a constant regular expression in a loop, causes the constant regular expression to be needlessly recompiled on each iteration.
The basic comparison process
Building an effective and efficient regular expression (regex) requires a good understanding of how a regex engine handles the comparison. What are the steps the regex engine takes in the comparison process? A lot of this chapter will be devoted to answering that question. A lot of the regex syntax will be introduced and explained in the many examples. I recommend Regular Expressions 101 as an online tool to test, explore and debug your regular expressions.
Regex engines evaluate text and regular expression from left to right. Some programming languages also allow regex engines to evaluate text from right to left, but not in JavaScript.
In examples below methods exec()
and match()
are used.
In methods using regular expressions
these methods are explained in more detail. They are basically used to retrieve the matching result.
As mentioned, regular expressions are used to describe a search pattern. Therefore, a regular expression generally contains literal characters to search for,
often referred to as "literals".
Regular expression /now/
, for instance, matches the exact sequence of characters "now" when it occurs in a string:
no match will be found in the string "no way!" ("no w" does not equal "now") but it will find a match in
the string "What do you know about regular expressions?". Literal characters include spaces and newline characters!
First let's have a look at how a regular expression engine processes a regular expression without backtracking. Backtracking occurs when a regular expression contains quantifiers. Later more about this. Describing the process of comparing step by step may seem rather trivial and verbose, but it is actually important in understanding how regular expressions work and how they should be constructed.
When applying a regex to a string, a regex engine tries (from left to right) to match the first element in line in the regex with the first character in line in the string. When there is a match, the engine advances to the next element in the regex and to the next character in the string. When there is no match, the regex engine returns to the beginning of the regex and goes (back) to the character in the string where the next possible match may be found.
This process continues until all the elements of the regex have found a match or failed to find a match. If all the elements of the regex have found a match, the engine reports success, the match itself and where the match was found in the string (the index). The regex engine is lazy; it will stop immediately after the regex matched, even if there may be more matches in the rest of the string (unless the global flag is set, later more about this). If the last character in the string have been tried and the rexex never matched, the engine stops and reports failure.
const text = 'Supercalifragilisticexpialidocious';
console.log(/a/.exec(text)); // logs: [ "a" ]
console.log(/pa/.exec(text)); // logs: null
console.log(/perl/.exec(text)); // logs: null
In the example above, with the first regex /a/
, the engine tries to match the first element a
with the first character in the string "S".
This fails, so the engine returns to the beginning of the regex and compares again the a
, now with the next character in the string "u".
This fails again. This continues until a
is compared with the first "a" in the string, which is a match. There are no more elements in the
regex, so the entire rexex found a match and the engine stops and reports success.
With regular expression /pa/
, the first element p
is compared with the first character in the string "S", which fails to match.
This continues in the same way as in the first rexex until p
matches the first "p" in the string. Then the engine advances one step in both the
regex and the string. The a
from the regex is compared with the "e" in the string, which fails to be a match.
The engine returns to the beginning of the regex again and steps to the character where the next possible match may be found.
The engine "knows" now that /pa/
cannot find a match starting at the first "p" in the string, so it steps to the "e" right after "p".
The p
does not match the "e", so the engine advances character by character through the string until p
matches the second "p" in the string. But then the a
does not match the "i" and the p
again is compared with the
"i", "a", "l" etc. until all characters have been tried, without success. No match found.
With regular expression /perl/
, after failing "S" and "u", p
finds a match with the first "p" in the string and
subsequently e
matches "e" and r
matches "r". Then l
fails to match "c". The engine returns to the beginning
of the regex again but to what character in the string does it step?
The engine "knows" now that /perl/
cannot find a match starting at the first "p" in the string,
so it steps back in the string to the "e" after the first "p". Failing to find a match continues to the end of the string.
No match found.
Special characters and escaping
To be able to write more advanced regular expressions, special characters can be combined with literal characters to search for.
For instance, in /no*/
the *
is a special character. It matches the preceding character zero or more times.
So, /no*/
matches "noo" in "noob" and "n" in "The poor dirt farmer's old nag".
If you want to include a special character literally as the character itself, you can escape it using a backslash, much like it is done in escape sequences in a string.
const myString = "Use the * to call out a footnote.";
const regEx = /\*/;
console.log(myString.replace(regEx, 'asterisk'));
// logs: "Use the asterisk to call out a footnote."
The backslash itself is also a special character and so it also needs to be escaped if it needs to be taken literally.
For example const myRegExpr = /\\/g;
matches all instances of a backslash in a string.
A forward slash ("/") is not a special character, but in a regular expression literal it still needs to be escaped
if it needs to be taken literally, otherwise it terminates the pattern.
For instance, to search for all instances of two slashes ("//") followed by any text to the end of the line,
you can use regular expression const myRegExpr = /\/\/.*/g;
.
Both slashes are escaped ("\/\/"). The .*
matches zero or more times "wildcard" .
, i.e., any character until a line terminator.
The g
after the last slash is a flag,
in this case indicating that the regular expression should be tested against all possible matches in the string.
When using the RegExp()
constructor function, the first argument can be a string.
In his case backslashes need to be escaped with a second backslash because a backslash also acts as an escape in a string.
To avoid an abundance of slashes, always use a regex literal as first argument of a RegExp()
constructor.
Next to special characters, there are also compound regular expression syntax constructions,
constructed with characters that may individually have different meaning than when used in such a construction.
Even non-special characters may be part of these constructions in which they do have special meaning.
For instance, construction \w
matches any alphanumeric character from the basic Latin script, but the
w
on its own is just a literal character "w" in a regular expression.
console.log("We have a wish".match(/w\w*/gi)); // logs: [ "We", "wish" ]
A list of special characters and regular expression syntax constructions can be found in MDN's Regular expression syntax cheatsheet.
Searching with flags
As we have seen, regular expressions have optional flags.
Multiple flags in any order can be used in an expression. Flags are part of the regular expression.
Flags allow for functionality like global searching (g
flag) or case-insensitive searching (i
flag).
A list of flags can be found in MDN's
Regular expressions -- Advanced searching with flags.
The s
, u
and y
flags will be covered later.
The m
flag enables a multi-line search. It affects the behavior of ^
and $
.
The caret ^
matches at the beginning of the string, and the dollar $
matches at the end.
Both expressions do not return a match, they only return whether a match failed or succeeded.
With the m
flag enabled, ^
matches immediately after a line terminator
and at the beginning of the string.
With the m
flag enabled, $
matches immediately before a line terminator
and at the end of the string.
Line terminators are \n
, \r
\u2028
and \u2029
.
Pressing the ↵ Enter key inserts a newline (\n
) in the code.
const str = `tree
rose\nfish\rfire
bird`;
console.log(str.match(/^fish/)); // logs: null
console.log(str.match(/^fish/m)); // logs: [ "fish" ]
console.log(str.match(/fish$/m)); // logs: [ "fish" ]
console.log(str.match(/^\w/gm)); // logs: [ "t", "r", "f", "f", "b" ]
console.log(str.match(/\w$/gm)); // logs: [ "e", "e", "h", "e", "d" ]
Repetition
Quantifiers or repetition operators are constructs that attempt to match the preceding regex element a certain number of times.
{}
-
Matches the preceding element a minimum and maximum number of times, specified by integers enclosed by curly brackets.
For example:
x{2}
matches exactly 2 occurrences of the precedingx
,
x{2,}
matches at least 2 occurrences of the precedingx
,
x{2,10}
matches at least 2 and at most 10 occurrences of the precedingx
. *
- Matches the preceding element 0 or more times. Equivalent to
{0,}
. +
- Matches the preceding element 1 or more times. Equivalent to
{1,}
. ?
-
Matches the preceding element 0 or 1 times. Equivalent to
{0,1}
. This makes the preceding element optional:neighbou?r
matches both neighbor and neighbour.
const text = 'Oooh nooo, one Oompa-Loompa!';
console.log(text.match(/no*/gi)); // logs: [ "nooo", "n" ]
console.log(text.match(/no?/gi)); // logs: [ "no", "n" ]
console.log(text.match(/o+/gi)); // logs: [ "Ooo", "ooo", "o", "Oo", "oo" ]
console.log(text.match(/o{3}/gi)); // logs: [ "Ooo", "ooo" ]
// /no*/gi is the same as: /no{0,}/gi
// /no?/gi is the same as: /no{0,1}/gi
// /o+/gi is the same as: /o{1,}/gi
Consider the next code sample. The steps the engine takes in the comparison will be explained underneath. Again, this is important for understanding regular expressions.
const text = '#Incomprehensibilities';
console.log(/Ic*/.exec(text)); // logs: [ "I" ]
console.log(/\w*/.exec(text)); // logs: [ "" ] // this is NOT a fail!
console.log(/Ic+/.exec(text)); // logs: null // this regex returned a fail
console.log(/I\w*/.exec(text)); // logs: [ "Incomprehensibilities" ]
console.log(/\w+/.exec(text)); // logs: [ "Incomprehensibilities" ]
console.log(/I\w?/.exec(text)); // logs: [ "In" ]
In the above example /Ic*/
first tries to match I
with the first character (index 0, "#") in the string,
which fails. Then the engine steps back to the beginning of the regex and tries to match I
with character "I" on index 1 in the string.
This match succeeds. The engine advances to the next element in the regex c*
and tries to match this with character "n" on index 2 in the string.
This is a pass! Although c
does not match "n", zero matches is a pass for the *
quantifier!
The comparing stops because even though it was a pass, it was not a match. The next step is not comparing c
with the "c" (index 3) in the string.
All the elements in the regex have succeeded.
The engine returns a match for the entire regex: an "I" for the first regex element followed by an empty string
for the "zero matches" of the second regex element.
In the above example /\w*/
first tries the first character (index 0, "#") in the string, which is a pass, but not a match!
Although "#" is not an alphanumeric character from the basic Latin script, zero matches is a pass for the *
quantifier!
The comparing stops and the JavaScript regex engine returns an empty string for zero matches
(although regex engines other than the one in JavaScript may return null
for a match).
The index of where the regex succeeded in the string is 0 (although no match was found here).
Note that regex /\w/
(without *
) would have resulted in "I".
In the above example /Ic+/
finds a match for the first element in the regex I
and the second character "I" (index 1) in the string.
Then the second element in the regex c+
fails to match character "n" (index 2) in the string.
Quantifier +
demands at least one match between c
and the string character, which is not the case.
The engine returns to the beginning of the regex and tries to match I
with "n" (index 2), which fails. Comparing
I
with all the rest of the characters also fails to find a match. The engine reports a fail for the entire regex; no match found.
In the above example /I\w*/
finds a match for the first element in the regex I
and the second character "I" (index 1) in the string.
Then the second element in the regex \w*
gets compared with character "n" (index 2) in the string, which succeeds.
Not only does it succeed because zero matches is already sufficient with the *
quantifier, \w
also actually matches "n".
Now it does not stop the comparing. It continues to compare \w*
with the rest of the characters in the string
until it fails to match, which, again, does not fail the regex element, but it ends the comparing.
Backtracking, greedy and lazy
The default behavior of quantifiers is to repeat the preceding character as many times as it can. Quantifiers are said to be greedy by default. Suppose we have a text and we want to match all HTML tags in it using a regex.
const text = 'This is <em>not</em> what I expected!';
const regex = /<.+>/g;
console.log(text.match(regex)); // logs: [ "<em>not</em>" ]
What happened here? After the match with the first "<" the regex engine starts repeatedly matching any next character
(including > and <) until there are no more characters left, as .+
demands.
This is due to the greediness of +
. So now we have the result: "<em>not</em> what I expected!"
Then the engine takes the next (last) regex element >
of the regular expression into consideration,
but there are no more characters left to test against. The engine realizes that it probably matched too many characters and starts to
backtrack.
It takes one step back in the .+
iteration and tries to compare >
with the last character in the string "!", which fails.
It again takes one step back in the iteration and tries to compare >
with the next character "d" "freed" from the iteration,
which fails again.
This continues until >
matches the second ">" in the string. This results into the match "<em>not</em>".
As mentioned before, the regex engine will report the first valid match it finds.
It will not continue backtracking further to see if there is another possible match.
Quantifiers can be set to be lazy by putting a question mark directly after the quantifier: +?
, *?
, ??
,
{2,}?
. So, a question mark in a position like this has a different meaning than the quantifier ?
.
Now it repeats the preceding character the minimal number of times, taking note that it should backtrack in case the next regex element fails.
const text = 'This is <em>not</em> what I expected!';
const regex = /<.+?>/g;
console.log(text.match(regex)); // logs: [ "<em>", "</em>" ]
Now, after the match with the first "<" the .
matches the next character "e",
because the +
quantifier requires at least one match. This is the minimum number of matches.
The result so far is "<e".
The engine takes note that it should backtrack in case the next regex element fails.
Then the engine advances to the next (last) regex element >
and tries to match it with
the next character in the string "m", which fails. The engine backtracks to the dot .
in the regex again,
and tries to match it with "m", which succeeds.
Again the engine steps to the next (last) regex element >
and tries to match it with the next character in the string ">",
which now succeeds. The engine has now completed the regex without ending in a insurmountable failure and found "<em>" as the result.
The g
flag makes the engine then continue with the next character in the string and starting the regex from the beginning again.
In many (but not all) cases an alternative to using a lazy quantifier with a character class is to use a
negated character set [^x]
, meaning it matches anything but x
(see next section "Character sets").
In this case backtracking is not needed, which may make this alternative algorithm noticeable faster.
const text = 'This is <em>not</em> what I expected!';
const regex = /<[^>]+>/g;
console.log(text.match(regex)); // logs: [ "<em>", "</em>" ]
Character sets
A character to match any of a set of specific characters enclosed in square brackets. Such set is a character set.
For example, character set [aeiou]
matches a character that is any of the enclosed vowels.
A range of characters can be specified by using a hyphen. For example, [abcd]
is the same as [a-d]
.
As we have seen, character class \w
matches any alphanumeric character in the basic Latin script, including the underscore.
We can also write this as [A-Za-z0-9_]
.
A negated or complemented character set matches anything that is not enclosed in the square brackets.
For example, [^aeiou]
matches a character that is not any of the enclosed vowels.
Character class \W
is the same as [^A-Za-z0-9_]
.
const text = 'The discount is 50%.';
const regex = /[^A-Za-z0-9_]/g;
console.log(text.match(regex)); // logs: [ " ", " ", " ", "%", "." ] // 3 spaces, the "%" and the full stop.
Special characters like .
or ?
,
inside a character set, lose their special meaning and match the literal character.
Special characters part of a set usually do not need to be escaped, except:
- A hyphen
-
not at the beginning or end:[a\-b]
matches "a", "-" or "b". - A caret
^
only at the beginning:[\^ab]
matches "^", "a" or "b". - A closing square bracket
]
:[a\]b]
matches "a", "]" or "b". - A backslash
\
:[a\\b]
matches "a", "\" or "b".
Character class escape sequences, like \w
or \d
, remain having their special meaning when part of a set.
const text = '[2^(-2)]+1.2 = 1.45';
const regex = /[-^[\]().+=\d]/g;
console.log(text.match(regex)); // logs: [ "[", "2", "^", "(", "-", "2", ")", "]", "+", "1", ".", "2", "=", "1", ".", "4", "5" ]
const text = 'This \\ is a backslash.'; // The backslash in a STRING literal needs to be escaped.
const regex = /[\\]/g;
console.log(text); // logs: "This \ is a backslash."
console.log(text.match(regex)); // logs: [ "\\" ] // It logs an ESCAPED backslash!
Note that nesting sets, like [\w[^_]]
, is not possible.
The s
flag Allows .
to match newline characters.
Expression /./s
is equivalent to /[^]/
, since the latter means matching any character (including newline characters) except nothing.
Alternation
Alternation constructs match either one or the other, like in a|b
a character matches either a
or b
.
Set [ab]
matches exactly the same as a|b
, [abc]
matches exactly the same as a|b|c
.
The difference is that character sets only allow characters or character classes (\w
, \d
etc.).
Alternation constructs allow any regex on either side.
const text = 'Is it "color gray" or "colour grey"?';
const regex = /colou?r gray|grey/g;
console.log(text.match(regex)); // logs: [ "color gray", "grey" ]
The regex in the example above matches colou?r gray
or grey
. To get a correct pattern, parentheses need to be used
to group the alternation construct (see next section "Grouping and capturing").
const text = 'Is it "color gray" or "colour grey"?';
const regex = /colou?r (gray|grey)/g;
console.log(text.match(regex)); // logs: [ "color gray", "colour grey" ]
/* Next 3 regexes all do exactly the same:
/colou?r (gray|grey)/g
/colou?r gr(a|e)y/g
/colou?r gr[ae]y/g
*/
Grouping and capturing
Parts of a regular expression can be grouped together by placing these parts inside parentheses (()
).
This way you can, for instance, apply a quantifier to an entire group.
let text = "Did you say boo, hoo or boohoo?";
console.log(text.match(/boo(hoo)?/g)); // logs: "boo", "boohoo" ]
text = "The URLs http://subdomain.somewebsite.com/examples and ftp://somewebsite.com/"
console.log(text.match(/(https?|ftp):\/\/[^\s]+/g)); // logs: [ "http://subdomain.somewebsite.com/examples", "ftp://somewebsite.com/" ]
The grouped alternation construct https?|ftp
in the example above matches either
"http(s)" or "ftp". In this particular regex it needs to be grouped , otherwise
the rest of the regex would be part of the expression after the |
.
A group is per default a capturing group.
Capturing groups match what is between parentheses and remember the match.
However, groups are not necessarily intended to create a capture;
the use of parentheses can also serve to apply repetition or alternation to a group, as just showed.
If no capture is needed, you can optimize a group by explicitly marking it as a non-capturing group ((?:x)
),
like in /boo(?:hoo)?/
.
The captured matches of capturing groups can, for instance, be used in a method.
In the code sample below $1
and $2
are special replacement patterns.
They insert the captured matches of respectively the first and second parenthesized capturing groups.
(see MDN Web Docs -- String.prototype.replace() -- Specifying a string as a parameter).
const regex = /(\w+)\s(\w+)/; // regular expression with two capturing groups, matching successively the first and second word in the string.
let name = 'John Doe';
let nameReversed = name.replace(regex, '$2, $1');
console.log(nameReversed); // logs: "Doe, John"
The captured matches of capturing groups can be used in the result or in a method (example above), but they can also be used in the pattern itself.
It is possible to back reference to the nth (\n
) or a named (\k<Name>
) capturing group in the regular expression.
const text = `He said: "don't quote me on that".`;
const regex = /(['"]).*?\1/;
console.log(regex.exec(text)); // logs: [ '"don't quote me on that"' ]
The first found quote sign is captured as a "
. It is back referenced in the regex by using \1
(a reference to the first capturing group). Therefore the match does not stop at the apostrophe quote in "don't".
The same result can be achieved by using named capture groups and named back references.
const text = `He said: "don't quote me on that".`;
const regex = /(?<myQuote>['"]).*?\k<myQuote>/;
console.log(regex.exec(text)); // logs: [ '"don't quote me on that"' ]
Capturing groups and back references cannot be used inside character sets
(like [^\1]
or [^(.*)]
). The number 1 and the parentheses are just treated as literal characters.
Let's apply the regex in the last code sample to an other string and follow the steps the engine makes.
const text = `Descartes's dictum "I think therefore I am" from "Discourse on Method".`
const regex = /(['"]).*?\1/;
console.log(regex.exec(text)); // logs: [ '"I think therefore I am"' ]
The first match the engine finds is the "'" (index 9), matched by ['"]
in the regex. This element is in a capturing group, so
it remembers the "'", which can be back referenced to as group 1. Then the engine steps to regex element .*?
, which is immediately
passed over, since the minimum requirement of *
is zero occurrences and the quantifier is set to act lazy.
The engine though, takes note that it should backtrack in case the next regex element fails.
Then the next regex element \1
, which represents the captured "'", is compared with "s" (index 10), which fails to match.
The engine backtracks to the dot .
, which matches with the "s" (index 10).
The next character is compared with \1
and fails, the engine backtracks and find a match
and this continues all the way to the last character in the string.
Then It tries to match the \1
again because this element has not found a match yet, but no characters are left to test against.
This path failed, so an other path needs to be tried. The engine backtracks even further to the captured group. It tries
to match ['"]
again, with the character in the string where the next possible match may be found. The last failed path
started with the "'" character, so now it tries the next character "s". It fails, but now the engine has not encountered the quantifier yet;
it does not backtrack. The engine returns to the beginning of the regex again and tries the next character. This continues until
the quote """ (index 19) is matched. The capture gets a different value! The engine steps to the next regex element .*?
,
which is immediately passed over. The next regex element \1
,
which now represents the captured """, is compared with "s" (index 10), which fails to match.
The engine backtracks to the dot .
, which matches with the "s" (index 10).
The next character is compared with \1
and fails, the engine backtracks and find a match
and this continues until \1
matches the next """ (index 42). Now the entire regex found a match:
"I think therefore I am".
Note that with a g
flag, it would also have matched "Discourse on Method".
Catastrophic backtracking
Backtracking ensures that all potential matches are tried before the engine concludes that no match can be found. The danger is that when quantifiers are nested, the number of potential paths and steps the engine follow may become catastrophically large. Catastrophic in the sense that it causes stalling or even crashing the application.
An example of catastrophic backtracking
is the often quoted regex of the form
(x+x+)+y
applied to a string "xxxxxxxxxx" (10 x's).
The evolvement of steps taken by the engine are visualized in
Appendix B, along with an other example of catastrophic backtracking.
The relation between the string length and the number of needed steps is exponential in these examples.
With a string length of about 20 characters already more than a million steps are needed.
The examples are simplified forms of more complex regex constructions.
The x+x+
in regex (x+x+)+y
may stand for more complex regex constructs where for instance "anything"
(like .+
) is followed by "something" (like [^,]+
).
In general, a regex containing nested quantifiers, almost always has the potential of running a catastrophic course.
Nested quantifiers are quantifiers (repetition constructs) in a group that is itself repeated again.
Also nested alternation constructs (x|y
) or alternation constructs nested with repetition constructs (quantifiers)
should be considered suspicious.
Of course, in a "quick and dirty" solution you may escape by making sure that a string remains short or always meet a certain requirement.
Like in the example of (x+x+)+y
, things only go south when the string has a length
of some significance and ends with any character different than a "y".
Testing a regex in a debugger (with a maximized number of steps) before implementing is always recommendable.
But then, what to do if it predicts a disaster?
Preventing catastrophic backtracking
Sometimes rewriting a rexex into an equivalent expression solves the problem. The expression (x+x+)+y
, for instance,
can easily be transformed into x{2,}y
. Both match the same pattern, if the former would not be inclined to lose itself in endless backtracking.
Unfortunately, rewriting a regex is not always this easy or it is even impossible.
Making the quantifier(s) lazy provides no solution; it will still frenetically start backtracking, only in a different order.
Fortunately, there are options that may prevent catastrophic backtracking.
An option is to use quantifiers that do not backtrack. Quantifiers can be turned into possessive quantifiers by suffixing
a +
. For instance ^(\d+)*+$
or (x+x+)++y
. The engine will find a match if there is one, and if
the last character spoils a match, the engine will fail on the first try, with no need to backtrack.
The repeated group can also be made "atomic". Atomic groups (e.g. (?>(x+x+)+)y
)
do not remember any position to possibly backtrack to, hence, backtracking into such a group is impossible.
Unfortunately, JavaScript does not support possessive quantifiers nor does it support general atomic groups.
However, JavaScript does support lookaround assertions, which are a specific kind of atomic groups.
With a lookahead assertion ((?=x)
) and back-referencing to a captured group,
a possessive quantifier (e.g. *+
) can be simulated (see next paragraph).
Lookaround assertions
Lookaround assertions are lookahead and lookbehind assertions that are used to see if something is (not) directly followed or preceded by something else.
The lookaround expression does not return a match, it only returns whether a match failed or succeeded.
The lookaround assertions are the lookahead assertion ((?=x)
), the lookbehind assertion ((?<=x)
),
the negative lookahead assertion ((?!x)
) and the negative lookbehind assertion ((?<!x)
).
Before using lookbehind assertions ((?<=x)
and (?<!x)
),
check their browser support:
at the time of writing this, Safari does not support them.
Lookahead assertion (?=x)
passes only if the current position in the string is directly followed by "x".
const text = "lookaround";
const regex = /o(?=u)/;
console.log(regex.exec(text)); // logs: [ "o" ]
In the next example the lookahead assertion looks for a "u". It fails to find one at the first character.
The engine returns to the beginning of the regex again and to the next character in the string.
The engine repeats this procedure until it finds a match at index 7. The lookahead assertion succeeds, but the match is not stored.
The engine is still before the first character in the string. The regex finished and succeeded!
There was a match, but the match is empty since it was not stored, or "given up" if you like.
The engine returns an empty string (although regex engines other than the one in JavaScript may return null
as a "match").
const text = "lookaround";
const regex = /(?=u)/;
console.log(regex.exec(text)); // logs: [ "" ] // an empty string
In the next example the "o" right before the "u" is matched. The lookahead match "u" following "o" is passed over. The next character to try is the "u" again.
Then the n
tries to match the "u", which fails. No other "o" followed by "u" is found: the total match fails.
const text = "lookaround";
const regex = /o(?=u)nd/;
console.log(regex.exec(text)); // logs: null
In the next example the engine returns a fail after 10 steps.
The group ((x+x+)+)
finds a match with the 10 x's and captures this.
The Lookahead assertion succeeds but passes over its match and the engine starts to look for
\1
, which is the back-reference to the captured group "xxxxxxxxxx", which the engine finds.
Then the engine tries to match the "y" but fails because all characters have been matched already.
Now the engine does not backtrack into the group (atomic group), because the group has been given up
and \1
has no positions to backtrack into! The engine returns to the beginning of the regex and
to the next character of the string. This repeats until the end of the 10 x's. No match found.
const text = "xxxxxxxxxx";
const regex = /(?=((x+x+)+))\1y/; // equivalent to (x+x+)++y, which is not supported in JS.
console.log(regex.exec(text)); // logs: null
Thus (?=(a+))\1
, or with a named back reference (?=(?<quant>a+))\k<quant>
,
simulates the possessive quantifier a++
, which is not supported in JavaScript.
In this example of appendix A
a user input string is tested for matching a pattern defined by regular expression
/(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}/
. This pattern matches a string containing
at least one number, one uppercase letter, one lowercase letter, one special character (!@#$%^&) and at least 8 characters.
This regular expression contains 4 lookahead assertions. How does this regex work? First a simpler example.
In the next example the inner lookahead regex .*\d
matches "ABCabc4".
The lookahead assertion succeeded and erases the match from memory.
The fact that the lookahead assertion succeeded means that the string contains at least one digit.
If the string would not contain a digit, the lookahead assertion would fail and consequently the whole regex would fail.
The regex engine is still at index 0 in the string and the next element in the regex B
is tried, which fails to match "A".
The regex engine returns to the beginning of the regex again and to the next character in the string.
The lookahead assertion succeeds again and now the B
matches "B" in the string. The whole regex succeeded, finding the match "B".
Note that a lookaround assertion is atomic, meaning that no backtracking into the group occurs.
However, when the inner lookahead regex .*\d
tries to find a match, backtracking does occur, since
*
is a greedy quantifier. It backtracks from "ABCabc4efg" to the match "ABCabc4".
const regex = /(?=.*\d)B/;
const str = 'ABCabc4efg';
const match = regex.exec(str);
console.log(match); // logs: [ "B" ]
So, in the original regular expression /(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}/
if none of the 4 lookahead assertions fails (at least one number, one uppercase letter, one lowercase letter and one special character),
it tries matching at least 8 successive occurrences of any characters (except line terminators) from the beginning of the string.
By the way: the HTML input
element provides a pattern
attribute to
control user input by giving it a regular expression value. No JavaScript is needed.
<form action="#" onSubmit="return false">
<label>Password:
<input type="password" pattern="(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}" required />
</label>
<input type="submit" value="Submit" />
</form>