Character Sets
What if we want to match an arbitrary word? Enumerating every single word is obviously not feasible, so what to do instead? We can enumerate all letters and repeat them:
(
| 'a' | 'b' | 'c' | 'd'
| 'e' | 'f' | 'g' | 'h'
| 'i' | 'j' | 'k' | 'l'
| 'm' | 'n' | 'o' | 'p'
| 'q' | 'r' | 's' | 't'
| 'u' | 'v' | 'w' | 'x'
| 'y' | 'z'
)+
But this is very verbose and still only matches lowercase letters. We programmers tend to be lazy, so there must be a more convenient solution!
Character ranges
This expression matches words that can contain English lowercase and uppercase letters:
['a'-'z' 'A'-'Z']+
The square brackets indicate that this is a character set. A character set always matches exactly 1 character (more precisely, a Unicode codepoint). This character set contains two ranges, one for lowercase letters and one for uppercase letters. Together, this matches any character that is either an English lowercase or uppercase letter.
It’s also possible to add single characters to the set, for example:
['$' '_' 'a'-'z' 'A'-'Z']
Multiple characters can be put in the same quotes:
['$_' 'a'-'z' 'A'-'Z']
This is equivalent to ('$' | '_' | ['a'-'z' 'A'-'Z'])
, but it’s shorter.
Character ranges and Unicode
What is a range, exactly? Let’s see with an example:
['0'-'z']
This doesn’t seem to make sense, but it works. If you
try it out, you’ll notice that it
matches numbers, lowercase and uppercase letters. However, it also matches a few other characters,
e.g. the question mark ?
.
The reason is that Pomsky uses Unicode, a standard that assigns every character a numeric value.
When we write '0'-'z'
, Pomsky assumes that we want to match any character
whose numeric value is somewhere between the value of '0'
and the value
of 'z'
. This works well for letters (e.g. 'a'-'z'
)
and numbers ('0'-'9'
), because these have consecutive values in Unicode.
However, there are some special characters between digits, uppercase letters and lowercase letters:
Character | Unicode value |
---|---|
'0' | 48 |
'1' | 49 |
'2' | 50 |
… | |
'9' | 57 |
':' | 58 |
';' | 59 |
'<' | 60 |
'=' | 61 |
'>' | 62 |
'?' | 63 |
'@' | 64 |
'A' | 65 |
'B' | 66 |
… | |
'Z' | 90 |
'[' | 91 |
'\' | 92 |
']' | 93 |
'^' | 94 |
'_' | 95 |
'`' | 96 |
'a' | 97 |
… | |
'z' | 122 |
Why, you might ask? This is for historical reasons.
Unicode properties
The reason why Unicode was invented is that most people in the world don’t speak English, and many of them use languages with different alphabets. To support them, Unicode includes 149,813 codepoints covering 161 different scripts. Since we have a widely supported standard for supporting different languages, let’s use it!
The character class ['a'-'z' 'A'-'Z']
only recognizes Latin characters.
What should we do instead? We should use a
general category.
In this case, Letter
seems like a good choice. Pomsky makes it easy to use Unicode categories:
[Letter]
That’s it. This matches any letter from all 161 scripts! It’s also possible to match any codepoint in a certain script:
[Cyrillic Hebrew]
This matches a Cyrillic or Hebrew codepoint.
Some regex engines can also match Unicode properties other than categories and scripts. Useful properties include
Alpha
(letters and marks that can appear in a word)Upper
,Lower
(uppercase or lowercase letters)Emoji
Math
(mathematical symbols)
You can see the full list of Unicode properties here.
Negation
Character classes are negated by putting a !
in front of it. For example,
!['a'-'f']
matches anything except a letter between a
and f
.
It’s also possible to negate Unicode properties individually. For example,
[Latin !Alpha]
matches a codepoint that is either in the Latin script,
or is not alphabetic.
Remember the example from the previous page? We repeated the dot to match matching curly braces:
'{' .* '}'
But it didn’t work correctly because the dot is greedily repeated, so it can consume curly braces:
{foo} {bar}
^^^^^^^^^^
We can fix this by using a character class that doesn’t match curly braces:
'{' !['{}']* '}'