Character Classes

What if we want to match an arbitrary word? Enumerating every single word is obviously not feasible, so what to do instead? We can simply enumerate the characters and repeat them:

(
  | 'a' | 'b' | 'c' | 'd'
  | 'e' | 'f' | 'g' | 'h'
  | 'i' | 'j' | 'k' | 'l'
  | 'm' | 'n' | 'o' | 'p'
  | 'q' | 'r' | 's' | 't'
  | 'u' | 'v' | 'w' | 'x'
  | 'y' | 'z'
)+

But this very verbose and still only matches lowercase letters. We programmers tend to be lazy, so there must be a more convenient solution!

Character ranges

This expression matches words that can contain uppercase and lowercase letters:

['a'-'z' 'A'-'Z']+

What is this? The square brackets indicate that this is a character class. A character class always matches exactly 1 character (more precisely, a Unicode code point). This character class contains two ranges, one for lowercase letters and one for uppercase letters. Together, this matches any character that is either a lowercase or uppercase letter.

It’s also possible to add single characters, for example:

['$' '_' 'a'-'z' 'A'-'Z']

When we have several characters in a character class that aren’t part of a range, we can simply put them into the same quotes:

['$_' 'a'-'z' 'A'-'Z']

This is equivalent to ('$' | '_' | ['a'-'z' 'A'-'Z']), but it’s shorter and may be more efficient.

Character ranges and Unicode

What is a range, exactly? Let’s see with an example:

['0'-'z']

This doesn’t seem to make sense, but does work. If you compile it to a regex and try it out, you’ll notice that it matches numbers, lowercase and uppercase letters. However, it also matches a few other characters, e.g. the question mark ?.

The reason is that pomsky uses Unicode, a standard that assigns every character a numeric value. When we write '0'-'z', pomsky assumes that we want to match any character whose numeric value is somewhere between the value of '0' and the value of 'z'. This works well for letters (e.g. 'a'-'Z') and numbers ('0'-'9'), because these have consecutive numbers in Unicode. However, there are some special characters between digits, uppercase letters and lowercase letters:

CharacterUnicode value
'0'48
'1'49
'2'50
'9'57
':'58
';'59
'<'60
'='61
'>'62
'?'63
'@'64
'A'65
'B'66
'Z'90
'['91
'\'92
']'93
'^'94
'_'95
'`'96
'a'97
'z'122

Why, you might ask? This is for historical reasons.

Unicode properties

The reason why Unicode was invented is that most people in the world don’t speak English, and many of them use languages with different alphabets. To support them, Unicode includes 144,697 characters covering 159 different scripts. Since we have a standard that makes it really easy to support different languages, there’s no excuse for not using it.

The character class ['a'-'z' 'A'-'Z'] only recognizes Latin characters. What should we do instead? We should use a Unicode category. In this case, Letter seems like an obvious candidate. Pomsky makes it very easy to use Unicode categories:

[Letter]

That’s it. This matches any letter from all 159 scripts! It’s also possible to match any character in a specific script:

[Cyrillic Hebrew]

This matches a Cyrillic or Hebrew character. Not sure why you’d want to do that.

Some regex engines can also match Unicode properties other than categories and scripts. Probably the most useful ones are

  • Alphabetic (includes letters and marks that can appear in a word)
  • White_Space
  • Uppercase, Lowercase
  • Emoji

You can see the full list of Unicode properties here.

Negation

Character classes are negated by putting a ! in front of it. For example, !['a'-'f'] matches anything except a letter in the range from a to f.

It’s also possible to negate Unicode properties individually. For example, [Latin !Alphabetic] matches a code point that is either in the Latin script, or is not alphabetic.