Skip to content

Tokens

Tokens (also called terminals) cannot be further divided. There are the following token types used in the grammar:

Names (or identifiers) consist of a letter or underscore (_), followed by any number of letters, digits and underscores. For example:

# valid identifiers
hello i18n _foo_ Gänsefüßchen
# invalid identifiers
kebab-case 42 👍‍

A letter is any code point with the Alphabetic property, which can be matched in most regex flavors with \p{Alpha}. A digit is any code point from the Number general categories, which can be matched in most regex flavors with \pN.

Note that group names have more restrictions than variable names: They must be ASCII-only and may not contain underscores.

Identifiers may not be one of the following reserved words:

  • U
  • let
  • lazy
  • greedy
  • range
  • base
  • atomic
  • enable
  • disable
  • if
  • else
  • recursion
  • regex
  • test

There are some contextual keywords that have a special meaning only in a certain context:

  • match
  • reject
  • as
  • in
  • unicode

Contextual keywords can be used as variable and group names without issues.

A whole number without a sign and without leading zeros. For example:

# valid numbers
0 1 42 10000
# invalid numbers
042 -30 +30 30.1 10_000 10,000

A string is a sequence of code points surrounded by single or double quotes. In double quoted strings, double quotes and backslashes are escaped by preceding them with a backslash. No other escapes are supported. Single quoted strings don’t support any escaping:

# valid strings
'test' "test" "C:\\User\\Dwayne \"The Rock\" Johnson" 'C:\User\Dwayne "The Rock" Johnson'
'this is a
multiline string'
"this is a
multiline string"
# invalid strings
"\n" "\uFFFF" '\''

Within string literals, \r\n (CRLF) sequences are replaced with a single \n (LF). This is because text editors do not display the type of line ending, so users might save a Pomsky file with the wrong file ending by accident. In most regex engines, \n matches a line break regardless of the platform convention used.

Same as String, with the limitation that the string must contain exactly one code point. Example:

'a' 'ŧ' "\\"

A codepoint consists of U, +, and 1 to 6 hexadecimal digits (0-9, a-f, A-F). It must represent a valid Unicode scalar value. This means that it must be a valid codepoint, but not a UTF-16 surrogate. For example:

# valid codepoints
U+0 U+10 U+FFF U+10FFFF U + FF
# invalid codepoints
U+300000 U+00000001 U+D800 U+FGHI

The code point token is ‘special’ in that the + may be surrounded by spaces.

Punctuation tokens consist of visible ASCII characters. Most punctuation tokens are exactly one character, except for <<, >>, and ::. The full list of supported punctuation tokens is

>> << :: ^ $ < > % * + ? | : ( ) { } , ! [ - ] . ; =

Pomsky’s lexer can also lex a variety of illegal constructs, e.g. backslash escapes like \g<0> and groups such as (:?), in order to show more useful error messages.