What if we want to match an arbitrary word? Enumerating every single word is obviously not feasible, so what to do instead? We can simply enumerate the characters and repeat them:
( | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' )+
But this very verbose and still only matches lowercase letters. We programmers tend to be lazy, so there must be a more convenient solution!
This expression matches words that can contain uppercase and lowercase letters:
What is this? The square brackets indicate that this is a character class. A character class always matches exactly 1 character (more precisely, a Unicode code point). This character class contains two ranges, one for lowercase letters and one for uppercase letters. Together, this matches any character that is either a lowercase or uppercase letter.
It’s also possible to add single characters, for example:
['$' '_' 'a'-'z' 'A'-'Z']
When we have several characters in a character class that aren’t part of a range, we can simply put them into the same quotes:
['$_' 'a'-'z' 'A'-'Z']
This is equivalent to
('$' | '_' | ['a'-'z' 'A'-'Z']), but it’s shorter
and may be more efficient.
Character ranges and Unicode
What is a range, exactly? Let’s see with an example:
This doesn’t seem to make sense, but does work. If you compile it to a regex and
try it out, you’ll notice that it matches numbers, lowercase and
uppercase letters. However, it also matches a few other characters, e.g. the question mark
The reason is that pomsky uses Unicode, a standard that assigns every character a numeric value.
When we write
'0'-'z', pomsky assumes that we want to match any character
whose numeric value is somewhere between the value of
'0' and the value
'z'. This works well for letters (e.g.
and numbers (
'0'-'9'), because these have consecutive numbers in Unicode.
However, there are some special characters between digits, uppercase letters and lowercase letters:
The reason why Unicode was invented is that most people in the world don’t speak English, and many of them use languages with different alphabets. To support them, Unicode includes 144,697 characters covering 159 different scripts. Since we have a standard that makes it really easy to support different languages, there’s no excuse for not using it.
The character class
['a'-'z' 'A'-'Z'] only recognizes Latin characters.
What should we do instead? We should use a
In this case,
Letter seems like an obvious candidate. Pomsky makes it very easy to use Unicode
That’s it. This matches any letter from all 159 scripts! It’s also possible to match any character in a specific script:
This matches a Cyrillic or Hebrew character. Not sure why you’d want to do that.
Some regex engines can also match Unicode properties other than categories and scripts. Probably the most useful ones are
Alphabetic(includes letters and marks that can appear in a word)
You can see the full list of Unicode properties here.
Character classes are negated by putting a
! in front of it. For example,
!['a'-'f'] matches anything except a letter in the range from
It’s also possible to negate Unicode properties individually. For example,
[Latin !Alphabetic] matches a code point that is either in the Latin script,
or is not alphabetic.