Formal grammar

Summary

This document uses pomsky syntax to describe pomsky’s syntax. Here’s an incomplete summary, which should be enough to read the grammar:

  • Variables are declared as let var_name = expression;. This assigns expression to the variable var_name.

  • Verbatim text is wrapped in double quotes ("") or single quotes ('').

  • A * after a rule indicates that it repeats 0 or more times.

  • A + after a rule indicates that it repeats 1 or more times.

  • A ? after a rule indicates that the rule is optional.

  • Rules can be grouped together by wrapping them in parentheses (()).

  • Alternative rules are each preceded by a vertical bar (|).

Formal grammar

Comments start with # and end at the end of the same line. Comments and whitespace are ignored; they can be added anywhere.

Expression

let Expression = Statement* OrExpression;

let Statement =
    | LetDeclaration
    | Modifier;

let LetDeclaration = 'let' Name '=' OrExpression ';';

let Modifier = ModifierKeyword BooleanSetting ';';

let ModifierKeyword =
    | 'enable'
    | 'disable';

let BooleanSetting =
    | 'lazy'
    | 'unicode';

OrExpression

let OrExpression = ('|'? Alternatives)?;

let Alternatives = Alternative ('|' Alternative)*;

let Alternative = FixExpression+;

FixExpression

An expression which can have a prefix or suffix.

let FixExpression =
    | LookaroundPrefix Expression
    | AtomExpression RepetitionSuffix?;

Lookaround

let LookaroundPrefix =
    | '!'? '<<'
    | '!'? '>>';

Repetitions

let RepetitionSuffix = RepetitionCount Quantifier?;

let RepetitionCount =
    | '*'
    | '+'
    | '?'
    | RepetitionBraces;

let RepetitionBraces =
    | '{' Number '}'
    | '{' Number? ',' Number? '}';

let Quantifier =
    | 'greedy'
    | 'lazy';

AtomExpression

let AtomExpression =
    | Group
    | String
    | CharacterSet
    | InlineRegex
    | Boundary
    | Reference
    | NumberRange
    | CodePoint
    | Name
    | '.';

Group

let Group = GroupKind? '(' Expression ')';

let GroupKind =
    | ':' Name?
    | 'atomic';

Note: A group name must be ASCII-only and may not contain underscores. Furthermore, a group name must be no longer than 32 characters. For example:

:underscores_are_invalid()  :äöéŧûøIsInvalid()
:thisGroupNameIsTooLongUnfortunately()

:thisIsAllowed()

These restrictions exist because of Java. To make Pomsky behave consistently across regex flavors, we have to use the most restrictive rules for all flavors.

CharacterSet

let CharacterSet =
    | '!'? '[' '.' ']' # deprecated!
    | '!'? '[' CharacterSetInner+ ']';

let CharacterSetInner =
    | Range
    | String
    | CodePoint
    | NonPrintable
    | Shorthand
    | UnicodeProperty
    | PosixClass;

let Range = SingleChar '-' SingleChar;

let SingleChar =
    | StringOneChar
    | CodePoint
    | NonPrintable;

let NonPrintable =
    | 'n' | 'r' | 't'
    | 'a' | 'e' | 'f';

let Shorthand = '!'? ShorthandIdent;

let ShorthandIdent =
    | 'w' | 'word'
    | 'd' | 'digit'
    | 's' | 'space'
    | 'h' | 'horiz_space'
    | 'v' | 'vert_space'
    | 'l' | 'line_break'

let PosixClass =
    | 'ascii'
    | 'ascii_alpha'
    | 'ascii_alnum'
    | 'ascii_blank'
    | 'ascii_cntrl'
    | 'ascii_digit'
    | 'ascii_graph'
    | 'ascii_lower'
    | 'ascii_print'
    | 'ascii_punct'
    | 'ascii_space'
    | 'ascii_upper'
    | 'ascii_word'
    | 'ascii_xdigit';

UnicodeProperty

let UnicodeProperty = '!'? Name;

Details about supported Unicode properties can be found here.

InlineRegex

let InlineRegex = 'regex' String;

Boundary

let Boundary =
    | '^'
    | '$'
    | '!'? '%';

Reference

let Reference =
    | '::' Name
    | '::' Sign? Number;

let Sign =
    | '+'
    | '-';

Note that references must be ASCII-only, so the allowed characters are a-z, A-Z, _, and 0-9. Numbers may not appear the start of the name.

NumberRange

let NumberRange = 'range' String '-' String Base?;

let Base = 'base' Number;

Note that the strings must contain digits or ASCII letters in the supported range. For example, in base 16, the characters 0123456789abcdefABCDEF are allowed. The base must be between 2 and 36.

Tokens

Tokens (also called terminals) cannot be further divided. There are the following token types used in the above grammar:

Name

Names (or identifiers) consist of a letter or underscore (_), followed by any number of letters, digits and underscores. For example:

# valid identifiers
hello  i18n  _foo_  Gänsefüßchen

# invalid identifiers
kebab-case  42  👍‍

A letter is any code point with the Alphabetic property, which can be matched in most regex flavors with \p{Alpha}. A digit is any code point from the Number general categories, which can be matched in most regex flavors with \pN.

Note that group names have more restrictions than variable names, see above.

Identifiers may not be one of the following reserved words:

  • U
  • let
  • lazy
  • greedy
  • range
  • base
  • atomic
  • enable
  • disable
  • if
  • else
  • recursion
  • regex
  • test

Number

A whole number without a sign and without leading zeros. For example:

# valid numbers
0  1  42  10000

# invalid numbers
042  -30  +30  30.1  10_000  10,000

String

A string is a sequence of code points surrounded by single or double quotes. In double quoted strings, double quotes and backslashes are escaped by preceding them with a backslash. No other escapes are supported: For example:

# valid strings
'test'  "test"  "C:\\User\\Dwayne \"The Rock\" Johnson"  'C:\User\Dwayne "The Rock" Johnson'

'this is a
multiline string'

"this is a
multiline string"

# invalid strings
"\n"  "\uFFFF"  '\''

StringOneChar

Same as String, with the limitation that the string must contain exactly one code point or grapheme.

CodePoint

A codepoint consists of U, +, and 1 to 6 hexadecimal digits (0-9, a-f, A-F). It must represent a valid Unicode scalar value. This means that it must be a valid codepoint, but not a UTF-16 surrogate. For example:

# valid codepoints
U+0  U+10  U+FFF  U+10FFFF  U + FF

# invalid codepoints
U+300000  U+00000001  U+D800  U+FGHI

Note that the + may be surrounded by spaces.

Note about this grammar

Even though this grammar is written using Pomsky syntax, it isn’t actually accepted by the pomsky compiler, because it uses cyclic variables.