Shorthands

There are a few shorthand character classes: word, digit, space, horiz_space and vert_space. They can be abbreviated with their first letter: w, d, s, h and v. Like Unicode properties, they must appear in square brackets.

  • word matches a word character, i.e. a letter, digit or underscore. It’s equivalent to [Alphabetic Mark Decimal_Number Connector_Punctuation Join_Control].
  • digit matches a digit. It’s equivalent to Decimal_Number.
  • space matches whitespace. It’s equivalent to White_Space.
  • horiz_space matches horizontal whitespace (tabs and spaces). It’s equivalent to [U+09 Space_Separator].
  • vert_space matches vertical whitespace. It’s equivalent to [U+0A-U+0D U+85 U+2028 U+2029].

Note that word, digit and space only match ASCII characters, if the regex engine isn’t configured to be Unicode-aware. How to enable Unicode support is described here.

If you want to match any code point, you can use Codepoint, or C for short. This does not require brackets, because it is a built-in variable. For example, this matches a double-quoted string:

'"' Codepoint* lazy '"'

What if I don’t need Unicode?

You don’t have to use Unicode-aware character classes such as [word] if you know that the input is only ASCII. Unicode-aware matching can be considerably slower. For example, the [word] character class includes more than 100,000 code points, so matching a [ascii_word], which includes only 63 code points, is faster.

Pomsky supports a number of ASCII-only shorthands:

Character classEquivalent
[ascii][U+00-U+7F]
[ascii_alpha]['a'-'z' 'A'-'Z']
[ascii_alnum]['0'-'9' 'a'-'z' 'A'-'Z']
[ascii_blank][' ' U+09],
[ascii_cntrl][U+00-U+1F U+7F]
[ascii_digit]['0'-'9']
[ascii_graph]['!'-'~']
[ascii_lower]['a'-'z']
[ascii_print][' '-'~']
[ascii_punct]['!'-'/' ':'-'@' '['-'`' '{'-'~']
[ascii_space][' ' U+09-U+0D]
[ascii_upper]['A'-'Z']
[ascii_word]['0'-'9' 'a'-'z' 'A'-'Z' '_']
[ascii_xdigit]['0'-'9' 'a'-'f' 'A'-'F']

Using them can improve performance, but be careful when you use them. If you aren’t sure if the input will ever contain non-ASCII characters, it’s better to err on the side of correctness, and use Unicode-aware character classes.

Non-printable characters

Characters that can’t be printed should be replaced with their hexadecimal Unicode code point. For example, you may write U+FEFF to match the Zero Width No-Break Space.

There are also 6 non-printable characters with a name:

  • [n] is equivalent to [U+0A], the \n line feed.
  • [r] is equivalent to [U+0D], the \r carriage return.
  • [f] is equivalent to [U+0C], the \f form feed.
  • [a] is equivalent to [U+07], the “alert” or “bell” control character.
  • [e] is equivalent to [U+0B], the “escape” control character.

Other characters have to be written in their hexadecimal form. Note that you don’t need to write leading zeroes, i.e. U+0 is just as ok as U+0000. However, it is conventional to write ASCII characters with two digits and non-ASCII characters with 4, 5 or 6 digits depending on their length.