Repetitions
On this page
When we want to match an expression multiple times, it would be cumbersome to repeat our expression. Instead, we can specify how often the expression should occur:
('r' | 'w' | 'x' | '-'){9}
This matches an r
, w
, x
or -
character 9 times. For example, it would match the string
rwxr-xr--
, or xxrr-xr-w
.
What if we want to match strings of different lengths? Repetitions are quite flexible, so we can specify a lower and upper bound for the number of repetitions:
('r' | 'w' | 'x' | '-'){3,9}
This matches between 3 and 9 characters. It’s also possible to omit the upper bound:
'la'{1,}
This matches the text la
at least 1 time. So it matches la
, but also lala
, lalala
,
lalalala
, and so on.
Abbreviations
Wanting to match someting at least once is so common that there’s a special syntax for it: The
+
.
'la'+
And there are two more special cases: {0,}
(repeated zero or more times) can be
written as *
, and {0,1}
can be written as ?
.
This leaves us with the following options:
Expression | Meaning |
---|---|
'la'{5,9} | Between 5 and 9 la s |
'la'{5,} | At least 5 la s |
'la'* | Any number of la s (including 0) |
'la'+ | At least 1 la |
'la'? | Maybe a la , maybe not |
Matching behavior
Pomsky expressions are often used to search for substrings in a text matching a particular pattern. For this, a regex engine is used; the Pomsky expression is first transformed into a regex (short for “regular expression”) by the Pomsky compiler, and then given to a regex engine. The regex engine then performs the search by walking over the text, until it finds a substring matching the regex.
But when the expression is repeated, how often should the regex engine attempt to repeat it? For
example, 'la'{2,4}
could repeat 2, 3 or 4 times. When searching the text
My favourite song goes like lalala la
, should it stop as soon as it found lala
, or should
it check if there is a third and fourth la
?
By default, regex engines are greedy: They try to repeat an expression as often as possible. Only
if that fails will they check if the expression matches with fewer repetitions. So in the above
example, the regex engine will give you the match lalala
. Since it is followed by a space, it
can’t match a fourth time.
It gets more interesting when a repetition is followed by another expression:
'la'{2,4} 'li'
Let’s see what happens when searching the string lalalalalali
for this pattern: The regex engine
first detects the string la
at the very start.
lalalalalali
^
It greedily attempts to repeat it 4 times, and succeeds. Now it is at the 9th character:
lalalalalali
^
Now it attempts to match the 'li'
part, but fails: There is no li
at the current
position. So the regex engine gives up the last repetition and tries again:
lalalalalali
^
This is called backtracking; think of it like wandering through a maze and trying out every path. Whenever we reach a dead end, we return to the previous junction and try the next path.
Unfortunately, the 'li'
part doesn’t match after the third repetition either, or
the second one. Now the regex engine has no more paths to explore, and gives up.
But it isn’t finished yet, maybe there is a match somewhere else in the string! So it returns to the start:
lalalalalali
^
Since the regex engine already tried at this position and failed, it skips to the next
occurrence of the substring la
, which at the 3rd character:
lalalalalali
^
Again, it tries to greedily repeat it four times, and succeeds!
lalalalalali
^
The next step is matching the 'li'
part, which succeeds, and the regex engine is
done. The matched substring is:
lalalalalali
^^^^^^^^^^
Not all regex engines use backtracking; a notable exception is Rust’s regex
library, which can
convert an expression to a lazy deterministic finite automaton (lazy DFA), a special kind of
state machine that never needs to backtrack.
However, “backtracking” is still a good mental model to understand what a regex engine does. Even
though Rust’s regex
library never backtracks, it always returns the same matches as a backtracking
regex engine would. It might just do it faster.