Repetitions

When we want to match an expression multiple times, it would be cumbersome to repeat our expression. Instead, we can specify how often the expression should occur:

('r' | 'w' | 'x' | '-'){9}

This matches an r, w, x or - character 9 times. For example, it would match the string rwxr-xr--, or xxrr-xr-w.

What if we want to match strings of different lengths? Repetitions are quite flexible, so we can specify a lower and upper bound for the number of repetitions:

('r' | 'w' | 'x' | '-'){3,9}

This matches between 3 and 9 characters. It’s also possible to omit the upper bound:

'la'{1,}

This matches the text la at least 1 time. So it matches la, but also lala, lalala, lalalala, and so on.

Abbreviations

Wanting to match someting at least once is so common that there’s a special syntax for it: The +.

'la'+

And there are two more special cases: {0,} (repeated zero or more times) can be written as *, and {0,1} can be written as ?.

This leaves us with the following options:

ExpressionMeaning
'la'{5,9}Between 5 and 9 las
'la'{5,}At least 5 las
'la'*Any number of las (including 0)
'la'+At least 1 la
'la'?Maybe a la, maybe not

Matching behavior

Pomsky expressions are often used to search for substrings in a text matching a particular pattern. For this, a regex engine is used; the Pomsky expression is first transformed into a regex (short for “regular expression”) by the Pomsky compiler, and then given to a regex engine. The regex engine then performs the search by walking over the text, until it finds a substring matching the regex.

But when the expression is repeated, how often should the regex engine attempt to repeat it? For example, 'la'{2,4} could repeat 2, 3 or 4 times. When searching the text My favourite song goes like lalala la, should it stop as soon as it found lala, or should it check if there is a third and fourth la?

By default, regex engines are greedy: They try to repeat an expression as often as possible. Only if that fails will they check if the expression matches with fewer repetitions. So in the above example, the regex engine will give you the match lalala. Since it is followed by a space, it can’t match a fourth time.

It gets more interesting when a repetition is followed by another expression:

'la'{2,4} 'li'

Let’s see what happens when searching the string lalalalalali for this pattern: The regex engine first detects the string la at the very start.

lalalalalali
^

It greedily attempts to repeat it 4 times, and succeeds. Now it is at the 9th character:

lalalalalali
        ^

Now it attempts to match the 'li' part, but fails: There is no li at the current position. So the regex engine gives up the last repetition and tries again:

lalalalalali
      ^

This is called backtracking; think of it like wandering through a maze and trying out every path. Whenever we reach a dead end, we return to the previous junction and try the next path.

Unfortunately, the 'li' part doesn’t match after the third repetition either, or the second one. Now the regex engine has no more paths to explore, and gives up.

But it isn’t finished yet, maybe there is a match somewhere else in the string! So it returns to the start:

lalalalalali
^

Since the regex engine already tried at this position and failed, it skips to the next occurrence of the substring la, which at the 3rd character:

lalalalalali
  ^

Again, it tries to greedily repeat it four times, and succeeds!

lalalalalali
          ^

The next step is matching the 'li' part, which succeeds, and the regex engine is done. The matched substring is:

lalalalalali
  ^^^^^^^^^^

Not all regex engines use backtracking; a notable exception is Rust’s regex library, which can convert an expression to a lazy deterministic finite automaton (lazy DFA), a special kind of state machine that never needs to backtrack.

However, “backtracking” is still a good mental model to understand what a regex engine does. Even though Rust’s regex library never backtracks, it always returns the same matches as a backtracking regex engine would. It might just do it faster.