Character Classes

What if we want to match an arbitrary word? Enumerating every single word is obviously not feasible, so what to do instead? We can simply enumerate the characters and repeat them:

(
  | 'a' | 'b' | 'c' | 'd'
  | 'e' | 'f' | 'g' | 'h'
  | 'i' | 'j' | 'k' | 'l'
  | 'm' | 'n' | 'o' | 'p'
  | 'q' | 'r' | 's' | 't'
  | 'u' | 'v' | 'w' | 'x'
  | 'y' | 'z'
)+

But this is very verbose and still only matches lowercase letters. We programmers tend to be lazy, so there must be a more convenient solution!

Character ranges

This expression matches words that can contain uppercase and lowercase letters:

['a'-'z' 'A'-'Z']+

The square brackets indicate that this is a character class. A character class always matches exactly 1 character (more precisely, a Unicode codepoint). This character class contains two ranges, one for lowercase letters and one for uppercase letters. Together, this matches any character that is either a lowercase or uppercase letter.

It’s also possible to add single characters, for example:

['$' '_' 'a'-'z' 'A'-'Z']

When we have several characters in a character class that aren’t part of a range, we can simply put them into the same quotes:

['$_' 'a'-'z' 'A'-'Z']

This is equivalent to ('$' | '_' | ['a'-'z' 'A'-'Z']), but it’s shorter and may be more efficient.

Character ranges and Unicode

What is a range, exactly? Let’s see with an example:

['0'-'z']

This doesn’t seem to make sense, but does work. If you try it out, you’ll notice that it matches numbers, lowercase and uppercase letters. However, it also matches a few other characters, e.g. the question mark ?.

The reason is that Pomsky uses Unicode, a standard that assigns every character a numeric value. When we write '0'-'z', Pomsky assumes that we want to match any character whose numeric value is somewhere between the value of '0' and the value of 'z'. This works well for letters (e.g. 'a'-'Z') and numbers ('0'-'9'), because these have consecutive values in Unicode. However, there are some special characters between digits, uppercase letters and lowercase letters:

CharacterUnicode value
'0'48
'1'49
'2'50
'9'57
':'58
';'59
'<'60
'='61
'>'62
'?'63
'@'64
'A'65
'B'66
'Z'90
'['91
'\'92
']'93
'^'94
'_'95
'`'96
'a'97
'z'122

Why, you might ask? This is for historical reasons.

Unicode properties

The reason why Unicode was invented is that most people in the world don’t speak English, and many of them use languages with different alphabets. To support them, Unicode includes 144,697 codepoints covering 159 different scripts. Since we have a standard that makes it really easy to support different languages, there’s no excuse for not using it.

The character class ['a'-'z' 'A'-'Z'] only recognizes Latin characters. What should we do instead? We should use a general category. In this case, Letter seems like a good candidate. Pomsky makes it easy to use Unicode categories:

[Letter]

That’s it. This matches any letter from all 159 scripts! It’s also possible to match any codepoint in a certain script:

[Cyrillic Hebrew]

This matches a Cyrillic or Hebrew codepoint.

Most regex engines can also match Unicode properties other than categories and scripts. Useful properties include

  • Alpha (letters and marks that can appear in a word)
  • Upper, Lower (uppercase or lowercase letters)
  • Emoji
  • Math (mathematical symbols)

You can see the full list of Unicode properties here.

What’s a codepoint?

A Unicode codepoint usually, but not always, represents a character. Exceptions are composite characters like ฤ‡ (which may consist of a ยด and a c when it isn’t normalized). Composite characters are common in many scripts, including Japanese, Indian and Arabic scripts. Also, an emoji can consist of multiple codepoints, e.g. when it has a gender or skin tone modifier.

Most regex engines look at one codepoint at a time. This means that [Letter] matches exactly one codepoint. The exception is .NET, which does not properly support Unicode, and character classes in .NET can only match codepoints from the Basic Multilingual Plane.

Negation

Character classes are negated by putting a ! in front of it. For example, !['a'-'f'] matches anything except a letter in the range from a to f.

It’s also possible to negate Unicode properties individually. For example, [Latin !Alpha] matches a codepoint that is either in the Latin script, or is not alphabetic.

Dot

You can use the dot (.) to match any codepoint, except line breaks. For example:

...  # 3 codepoints (except line breaks)

Most regex engines have a “singleline” option that changes the behavior of .. When enabled, . matches everything, even line breaks. Usually, the dot does not match \n (line feed) and possibly more line break characters depending on the regex flavor.

If you want to match any character, without having to enable the “singleline” option, Pomsky also offers the variable C (or Codepoint):

C C C  # 3 codepoints

Note that the number of codepoints is not always the number of visible characters. Also note that .NET does not properly support Unicode, and matches UTF-16 code units instead of codepoints. This means that when encountering a codepoint outside of the BMP, .NET matches each UTF-16 surrogate individually, so one . or C may match only half a codepoint in .NET.

Repeating the dot

Be careful when repeating C or .. My personal recommendation is to never repeat them. Let’s see why:

'{' .* '}'

This matches any content surrounded by curly braces. Why is this bad? Because .* will greedily consume anything, even curly braces, so looking for matches in the string {ab} de {fg} will return the whole string, but we probably expected to get the two matches {ab} and {fg}.

We can fix this by making the repetition lazy:

'{' .* lazy '}'

However, if the expression is followed by anything else, the dot may still consume curly braces. For example:

'{' .* lazy '};'

This expression will match the text {foo}}}};, which may not be desired. So it is usually better to restrict which characters can be repeated:

'{' !['{}']* '};'

Now the curly braces can contain anything except { and }, so we know that it will stop repeating when a brace is encountered, and fail if there’s no };.