Repetition

Repetitions allow matching an expression multiple times.

Syntax

let Repetition = AtomExpression RepetitionSuffix*;

let RepetitionSuffix = RepetitionCount RepetitionMode?;

let RepetitionCount =
    | '*'
    | '+'
    | '?'
    | RepetitionBraces;

let RepetitionBraces =
    | '{' Number '}'
    | '{' Number? ',' Number? '}';

let RepetitionMode =
    | 'greedy'
    | 'lazy';

See AtomExpression.

There is a restriction that a ? and + repetitions may not appear immediately after another repetition (unless the first repetition is followed by lazy or greedy). This is to prevent confusion, since .*? means a lazy repetition and .*+ a possessive repetition in regular expressions.

Example

.{4,12}
.+ lazy
('test'{3,})?

Support

Repetition is supported in all flavors. In some flavors, repetition is not supported within a lookbehind assertion.

Behavior

Every repetition has a lower bound and an optional upper bound. The braces ({lower,upper}) are the canonical way to represent a repetition.

SyntaxLower, upper bound
.{,}0, infinity
.{4,}4, infinity
.{,10}0, 10
.{4,10}4, 10
.{5}5, 5
.?0, 1
.*0, infinity
.+1, infinity

There are two repetition modes, greedy and lazy repetition. In greedy mode (the default), the regex engine tries to match the repeated expression as often as possible, whereas in lazy mode, the regex engines tries to match it as few times as possible.

The default repetition mode can be changed with enable lazy; or disable lazy;.

Compilation

Pomsky first determines the lower and upper bound of each repetition. After variable and range expansion, it may simplify nested repetitions using rules like the following:

  • x{1} = x
  • ''{a,b} = ''
  • x{a,b}{c} = x{a·c,b·c}
  • x{a}{b,c} = x{a·b,a·c}
  • (x{1,a})? = x{0,a}
  • x{1,a}? = x{0,a}
  • x*{a,b} = x*
  • x{a,b}{c,d} = x{a·c,b·d} if both a and c are 0 or 1

Note that most of these optimizations are only valid if either both repetitions are lazy or both are greedy.

Pomsky will then produce a maximally compact repetition, using ?, + or ? if possible, or using {n}, {n,} or {m,n} otherwise. If the repetition is lazy, another ? is added. For example, 'x'{1,} lazy compiles to x+?.

Sometimes the repeated expression must be wrapped in a non-capturing group, e.g. 'test'_ is compiled to (?:test)_.

Security concerns

Repetition (especially when nested) can be extremely slow, exhibiting exponential runtime, when executing the regex in a backtracking regex engine. Most regex engines use backtracking.

From the regex flavors supported by Pomsky, only Rust never uses backtracking, so it can guarantee linear runtime performance with respect to the haystack × regex length.

History

  • Implemented basic optimizations im Pomsky 0.8
  • Made + following a repetition illegal in Pomsky 0.6
  • Made ? following a repetition illegal in Pomsky 0.3
  • Changed the default repetition mode from lazy to greedy in Pomsky 0.3
  • Implemented in Pomsky 0.1