Shorthands
There are a few shorthand character classes: word
, digit
, space
, horiz_space
and
vert_space
. They can be abbreviated with their first letter: w
, d
, s
, h
and v
. Like
Unicode properties, they must appear in square brackets.
word
matches a word character, i.e. a letter, digit or underscore. It’s equivalent to[Alphabetic Mark Decimal_Number Connector_Punctuation Join_Control]
.digit
matches a digit. It’s equivalent toDecimal_Number
.space
matches whitespace. It’s equivalent toWhite_Space
.horiz_space
matches horizontal whitespace (tabs and spaces). It’s equivalent to[U+09 Space_Separator]
.vert_space
matches vertical whitespace. It’s equivalent to[U+0A-U+0D U+85 U+2028 U+2029]
.
Note that word
, digit
and space
only match ASCII characters, if the regex engine isn’t
configured to be Unicode-aware. How to enable Unicode support is
described here.
If you want to match any code point, you can use Codepoint
, or C
for short. This does not
require brackets, because it is a built-in variable.
For example, this matches a double-quoted string:
'"' Codepoint* lazy '"'
What if I don’t need Unicode?
You don’t have to use Unicode-aware character classes such as [word]
if you
know that the input is only ASCII. Unicode-aware matching can be considerably slower. For example,
the [word]
character class includes more than 100,000 code points, so
matching a [ascii_word]
, which includes only 63 code points, is faster.
Pomsky supports a number of ASCII-only shorthands:
Character class | Equivalent |
---|---|
[ascii] | [U+00-U+7F] |
[ascii_alpha] | ['a'-'z' 'A'-'Z'] |
[ascii_alnum] | ['0'-'9' 'a'-'z' 'A'-'Z'] |
[ascii_blank] | [' ' U+09], |
[ascii_cntrl] | [U+00-U+1F U+7F] |
[ascii_digit] | ['0'-'9'] |
[ascii_graph] | ['!'-'~'] |
[ascii_lower] | ['a'-'z'] |
[ascii_print] | [' '-'~'] |
[ascii_punct] | ['!'-'/' ':'-'@' '['-'`' '{'-'~'] |
[ascii_space] | [' ' U+09-U+0D] |
[ascii_upper] | ['A'-'Z'] |
[ascii_word] | ['0'-'9' 'a'-'z' 'A'-'Z' '_'] |
[ascii_xdigit] | ['0'-'9' 'a'-'f' 'A'-'F'] |
Using them can improve performance, but be careful when you use them. If you aren’t sure if the input will ever contain non-ASCII characters, it’s better to err on the side of correctness, and use Unicode-aware character classes.
Non-printable characters
Characters that can’t be printed should be replaced with their hexadecimal Unicode code point. For
example, you may write U+FEFF
to match the
Zero Width No-Break Space.
There are also 6 non-printable characters with a name:
[n]
is equivalent to[U+0A]
, the\n
line feed.[r]
is equivalent to[U+0D]
, the\r
carriage return.[f]
is equivalent to[U+0C]
, the\f
form feed.[a]
is equivalent to[U+07]
, the “alert” or “bell” control character.[e]
is equivalent to[U+0B]
, the “escape” control character.
Other characters have to be written in their hexadecimal form. Note that you don’t need to write
leading zeroes, i.e. U+0
is just as ok as U+0000
.
However, it is conventional to write ASCII characters with two digits and non-ASCII characters
with 4, 5 or 6 digits depending on their length.