Shorthands

There are abbreviations, called shorthands, for often needed character sets:

[digit] or [d] matches a decimal number. It is similar to ['0'-'9'], except that it is Unicode aware.
[word] or [w] matches a word character, i.e. a letter, digit or underscore. It’s similar to ['0'-'9' 'a'-'z' 'A'-'Z' '_'], except that it is Unicode aware. It matches all codepoints in the Alphabetic, Mark, Decimal_Number, Connector_Punctuation, and Join_Control Unicode categories.
[space] or [s] matches whitespace. It is equivalent to the White_Space category.
[horiz_space] or [h] matches horizontal whitespace, e.g. tabs und spaces.
[vert_space] or [v] matches vertical whitespace, e.g. line breaks.

These can be combined as well:

[d s '.']   # match digits, spaces, and dots

Note that word, digit and space only match ASCII characters, if the regex engine isn’t configured to be Unicode-aware. How to enable Unicode support is described here.

What if I don’t need Unicode?

You don’t have to use Unicode-aware character sets such as [digit] if you know that the input is only ASCII. Unicode-aware matching can be considerably slower. For example, the [word] character class includes more than 100,000 code points, so matching a [ascii_word] (which includes only 63 code points) is faster.

Pomsky supports a number of ASCII-only shorthands:

Character class	Equivalent
`[ascii]`	`[U+00-U+7F]`
`[ascii_alpha]`	`['a'-'z' 'A'-'Z']`
`[ascii_alnum]`	`['0'-'9' 'a'-'z' 'A'-'Z']`
`[ascii_blank]`	`[' ' U+09],`
`[ascii_cntrl]`	`[U+00-U+1F U+7F]`
`[ascii_digit]`	`['0'-'9']`
`[ascii_graph]`	`['!'-'~']`
`[ascii_lower]`	`['a'-'z']`
`[ascii_print]`	`[' '-'~']`
`[ascii_punct]`	['!'-'/' ':'-'@' '['-'`' '{'-'~']
`[ascii_space]`	`[' ' U+09-U+0D]`
`[ascii_upper]`	`['A'-'Z']`
`[ascii_word]`	`['0'-'9' 'a'-'z' 'A'-'Z' '_']`
`[ascii_xdigit]`	`['0'-'9' 'a'-'f' 'A'-'F']`

Using them can improve performance, but be careful when you use them. If you aren’t sure if the input will ever contain non-ASCII characters, it’s better to err on the side of correctness, and use Unicode-aware character classes.

Non-printable characters

Characters that can’t be printed should be replaced with their hexadecimal Unicode code point. For example, you may write U+FEFF to match the Zero Width No-Break Space.

There are also 6 non-printable characters with a name:

[n] matches the \n line feed.
[r] matches the \r carriage return.
[f] matches the \f form feed.
[a] matches the “alert” or “bell” control character.
[e] matches the “escape” control character.

Other characters have to be written in their hexadecimal form:

[U+10-U+30 U+FEFF]

Note that you don’t need to write leading zeroes, i.e. U+0 is just as ok as U+0000. However, it is conventional to write ASCII characters with two digits and non-ASCII characters with 4, 5 or 6 digits depending on their length.

Shorthands

What if I don’t need Unicode? #

Non-printable characters #

What if I don’t need Unicode?

Non-printable characters