There are a few shorthand character classes:
vert_space. They can be abbreviated with their first letter:
Unicode properties, they must appear in square brackets.
wordmatches a word character, i.e. a letter, digit or underscore. It’s equivalent to
[Alphabetic Mark Decimal_Number Connector_Punctuation Join_Control].
digitmatches a digit. It’s equivalent to
spacematches whitespace. It’s equivalent to
horiz_spacematches horizontal whitespace (tabs and spaces). It’s equivalent to
vert_spacematches vertical whitespace. It’s equivalent to
[U+0A-U+0D U+85 U+2028 U+2029].
space only match ASCII characters, if the regex engine isn’t
configured to be Unicode-aware. How to enable Unicode support is
If you want to match any code point, you can use
C for short. This does not
require brackets, because it is a built-in variable.
For example, this matches a double-quoted string:
'"' Codepoint* lazy '"'
What if I don’t need Unicode?
You don’t have to use Unicode-aware character classes such as
[word] if you
know that the input is only ASCII. Unicode-aware matching can be considerably slower. For example,
[word] character class includes more than 100,000 code points, so
[ascii_word], which includes only 63 code points, is faster.
Pomsky supports a number of ASCII-only shorthands:
Using them can improve performance, but be careful when you use them. If you aren’t sure if the input will ever contain non-ASCII characters, it’s better to err on the side of correctness, and use Unicode-aware character classes.
Characters that can’t be printed should be replaced with their hexadecimal Unicode code point. For
example, you may write
U+FEFF to match the
Zero Width No-Break Space.
There are also 6 non-printable characters with a name:
[n]is equivalent to
[r]is equivalent to
[f]is equivalent to
[a]is equivalent to
[U+07], the “alert” or “bell” control character.
[e]is equivalent to
[U+0B], the “escape” control character.
Other characters have to be written in their hexadecimal form. Note that you don’t need to write
leading zeroes, i.e.
U+0 is just as ok as
However, it is conventional to write ASCII characters with two digits and non-ASCII characters
with 4, 5 or 6 digits depending on their length.