Formal grammar
Summary
This document uses pomsky syntax to describe pomsky’s syntax. Here’s an incomplete summary, which should be enough to read the grammar:
Variables are declared as
let var_name = expression;
. This assignsexpression
to the variablevar_name
.Verbatim text is wrapped in double quotes (
""
) or single quotes (''
).A
*
after a rule indicates that it repeats 0 or more times.A
+
after a rule indicates that it repeats 1 or more times.A
?
after a rule indicates that the rule is optional.Rules can be grouped together by wrapping them in parentheses (
()
).Alternative rules are each preceded by a vertical bar (
|
).
Formal grammar
Comments start with #
and end at the end of the same line. Comments and whitespace are ignored;
they can be added anywhere.
Expression
let Expression = Statement* OrExpression;
let Statement =
| LetDeclaration
| Modifier;
let LetDeclaration = 'let' Name '=' OrExpression ';';
let Modifier = ModifierKeyword BooleanSetting ';';
let ModifierKeyword =
| 'enable'
| 'disable';
let BooleanSetting =
| 'lazy'
| 'unicode';
OrExpression
let OrExpression = ('|'? Alternatives)?;
let Alternatives = Alternative ('|' Alternative)*;
let Alternative = FixExpression+;
FixExpression
An expression which can have a prefix or suffix.
let FixExpression =
| LookaroundPrefix Expression
| AtomExpression RepetitionSuffix?;
Lookaround
let LookaroundPrefix =
| '!'? '<<'
| '!'? '>>';
Repetitions
let RepetitionSuffix = RepetitionCount Quantifier?;
let RepetitionCount =
| '*'
| '+'
| '?'
| RepetitionBraces;
let RepetitionBraces =
| '{' Number '}'
| '{' Number? ',' Number? '}';
let Quantifier =
| 'greedy'
| 'lazy';
AtomExpression
let AtomExpression =
| Group
| String
| CharacterSet
| InlineRegex
| Boundary
| Reference
| NumberRange
| CodePoint
| Name
| '.';
Group
let Group = GroupKind? '(' Expression ')';
let GroupKind =
| ':' Name?
| 'atomic';
Note: A group name must be ASCII-only and may not contain underscores. Furthermore, a group name must be no longer than 32 characters. For example:
:underscores_are_invalid() :äöéŧûøIsInvalid()
:thisGroupNameIsTooLongUnfortunately()
:thisIsAllowed()
These restrictions exist because of Java. To make Pomsky behave consistently across regex flavors, we have to use the most restrictive rules for all flavors.
CharacterSet
let CharacterSet =
| '!'? '[' '.' ']' # deprecated!
| '!'? '[' CharacterSetInner+ ']';
let CharacterSetInner =
| Range
| String
| CodePoint
| NonPrintable
| Shorthand
| UnicodeProperty
| PosixClass;
let Range = SingleChar '-' SingleChar;
let SingleChar =
| StringOneChar
| CodePoint
| NonPrintable;
let NonPrintable =
| 'n' | 'r' | 't'
| 'a' | 'e' | 'f';
let Shorthand = '!'? ShorthandIdent;
let ShorthandIdent =
| 'w' | 'word'
| 'd' | 'digit'
| 's' | 'space'
| 'h' | 'horiz_space'
| 'v' | 'vert_space'
| 'l' | 'line_break'
let PosixClass =
| 'ascii'
| 'ascii_alpha'
| 'ascii_alnum'
| 'ascii_blank'
| 'ascii_cntrl'
| 'ascii_digit'
| 'ascii_graph'
| 'ascii_lower'
| 'ascii_print'
| 'ascii_punct'
| 'ascii_space'
| 'ascii_upper'
| 'ascii_word'
| 'ascii_xdigit';
UnicodeProperty
let UnicodeProperty = '!'? Name;
Details about supported Unicode properties can be found here.
InlineRegex
let InlineRegex = 'regex' String;
Boundary
let Boundary =
| '^'
| '$'
| '!'? '%';
Reference
let Reference =
| '::' Name
| '::' Sign? Number;
let Sign =
| '+'
| '-';
Note that references must be ASCII-only, so the allowed characters are a-z
, A-Z
, _
, and 0-9
. Numbers may not appear the start of the name.
NumberRange
let NumberRange = 'range' String '-' String Base?;
let Base = 'base' Number;
Note that the strings must contain digits or ASCII letters in the supported range. For example,
in base 16
, the characters 0123456789abcdefABCDEF
are allowed. The base must be between 2 and 36.
Tokens
Tokens (also called terminals) cannot be further divided. There are the following token types used in the above grammar:
Name
Names (or identifiers) consist of a letter or underscore (_
), followed by any number of letters,
digits and underscores. For example:
# valid identifiers
hello i18n _foo_ Gänsefüßchen
# invalid identifiers
kebab-case 42 👍
A letter is any code point with the Alphabetic
property, which can be matched in most regex flavors with \p{Alpha}
. A digit is any code point from the Number
general categories, which can be matched in most regex flavors with \pN
.
Note that group names have more restrictions than variable names, see above.
Identifiers may not be one of the following reserved words:
U
let
lazy
greedy
range
base
atomic
enable
disable
if
else
recursion
regex
test
Number
A whole number without a sign and without leading zeros. For example:
# valid numbers
0 1 42 10000
# invalid numbers
042 -30 +30 30.1 10_000 10,000
String
A string is a sequence of code points surrounded by single or double quotes. In double quoted strings, double quotes and backslashes are escaped by preceding them with a backslash. No other escapes are supported: For example:
# valid strings
'test' "test" "C:\\User\\Dwayne \"The Rock\" Johnson" 'C:\User\Dwayne "The Rock" Johnson'
'this is a
multiline string'
"this is a
multiline string"
# invalid strings
"\n" "\uFFFF" '\''
StringOneChar
Same as String
, with the limitation that the string must contain exactly one code point or
grapheme.
CodePoint
A codepoint consists of U
, +
, and 1 to 6 hexadecimal digits (0-9, a-f, A-F). It must
represent a valid Unicode scalar value. This means that it must be a valid codepoint, but not a
UTF-16 surrogate. For example:
# valid codepoints
U+0 U+10 U+FFF U+10FFFF U + FF
# invalid codepoints
U+300000 U+00000001 U+D800 U+FGHI
Note that the +
may be surrounded by spaces.
Note about this grammar
Even though this grammar is written using Pomsky syntax, it isn’t actually accepted by the pomsky compiler, because it uses cyclic variables.