There are abbreviations, called shorthands, for often needed character sets:
[d]matches a decimal number. It is similar to
['0'-'9'], except that it is Unicode aware.
[w]matches a word character, i.e. a letter, digit or underscore. It’s similar to
['0'-'9' 'a'-'z' 'A'-'Z' '_'], except that it is Unicode aware. It matches all codepoints in the Alphabetic, Mark, Decimal_Number, Connector_Punctuation, and Join_Control Unicode categories.
[s]matches whitespace. It is equivalent to the White_Space category.
[h]matches horizontal whitespace, e.g. tabs und spaces.
[v]matches vertical whitespace, e.g. line breaks.
These can be combined as well:
[d s '.'] # match digits, spaces, and dots
space only match ASCII characters, if the regex engine isn’t
configured to be Unicode-aware. How to enable Unicode support is
What if I don’t need Unicode?
You don’t have to use Unicode-aware character sets such as
[digit] if you
know that the input is only ASCII. Unicode-aware matching can be considerably slower. For example,
[word] character class includes more than 100,000 code points, so
[ascii_word] (which includes only 63 code points) is faster.
Pomsky supports a number of ASCII-only shorthands:
Using them can improve performance, but be careful when you use them. If you aren’t sure if the input will ever contain non-ASCII characters, it’s better to err on the side of correctness, and use Unicode-aware character classes.
Characters that can’t be printed should be replaced with their hexadecimal Unicode code point. For
example, you may write
U+FEFF to match the
Zero Width No-Break Space.
There are also 6 non-printable characters with a name:
[a]matches the “alert” or “bell” control character.
[e]matches the “escape” control character.
Other characters have to be written in their hexadecimal form:
Note that you don’t need to write leading zeroes, i.e.
U+0 is just as ok as
U+0000. However, it is conventional to write ASCII characters with two digits and
non-ASCII characters with 4, 5 or 6 digits depending on their length.