A character set allows matching one of several code points.
let CharacterSet = '[' CharacterSetInner+ ']'; let CharacterSetInner = | Range | String | CodePoint | NonPrintable | Shorthand | UnicodeProperty | AsciiShorthand; let Range = SingleChar '-' SingleChar; let SingleChar = | StringOneChar | CodePoint | NonPrintable; # deprecated! let NonPrintable = | 'n' | 'r' | 't' | 'a' | 'e' | 'f'; let Shorthand = '!'? ShorthandIdent; let ShorthandIdent = | 'w' | 'word' | 'd' | 'digit' | 's' | 'space' | 'h' | 'horiz_space' | 'v' | 'vert_space' let AsciiShorthand = | 'ascii' | 'ascii_alpha' | 'ascii_alnum' | 'ascii_blank' | 'ascii_cntrl' | 'ascii_digit' | 'ascii_graph' | 'ascii_lower' | 'ascii_print' | 'ascii_punct' | 'ascii_space' | 'ascii_upper' | 'ascii_word' | 'ascii_xdigit'; let UnicodeProperty = '!'? Name;
['ad' 'f'-'x' Greek digit n U+FEFF]
Character sets are supported in all flavors. However, not all Unicode properties are supported in all flavors.
Furthermore, in .NET, character sets incorrectly match UTF-16 code units rather than code points. This means that a character set can not be used for characters outside the Basic Multilingual Plane (BMP) in .NET.
word cannot be negated if the character set contains other items as well. For
[!word s] does not work. The reason is that
\w is polyfilled in
A character set matches a single Unicode code point. It is surrounded by
 square brackets and
can contain an arbitrary number of characters, code points,
character ranges, non-printable characters, shorthand character classes, and Unicode properties.
A character set is a set in the mathematical sense, matching the union of everything written in the square brackets.
Code Points and Characters
Character sets can contain code points such as
U+FEFF, and strings, which are treated as the set of their code points. For example,
['ace'] is equivalent to
['a' 'c' 'e'], or
[U+61 U+63 U+65].
In .NET, only code points in the BMP are allowed.
Ranges of code points can be specified like
[U+40-U+50]. Ranges must be ascending and non-empty: The first code point must be
lower than the second code point, so they constitute a lower and upper bound. Both bounds are
included in the set. Each bound can be either a string containing exactly one code point, a code
point literal, or a non-printable character (see below). Non-printable characters in ranges are
There are 6 non-printable ASCII characters with a special syntax:
ais equivalent to
tis equivalent to
nis equivalent to
eis equivalent to
fis equivalent to
ris equivalent to
There exist a variety of shorthands that can be used in a character set.
The following general shorthands exist, each of which has a full name and a single-character alias:
The following ascii shorthands exist:
Details about supported Unicode properties can be found here.
Negation of shorthands
Shorthands (except for ASCII shorthands) are special in that they can be negated. However, only a single exclamation mark is allowed in front of shorthands, so no double negation is possible.
There are some exceptions though:
h can’t be negated.
w can’t be negated when targetting
/v flag becomes widely supported.
Usually, compiling character sets is straightforward, but there are some edge cases. Character sets
translate to brackets (
[···]), usually called “character classes” in regex lingo. Negated
character sets translate to negative character classes (
[^···]). Negating a single-character
string also produces a character class, whereas a non-negated character class with only a single
element is unwrapped:
['ad'] # [ad] !['ad'] # [^ad] !'a' # [^a] ['a'] # a
Pomsky removes duplicate items and eliminates double negation where possible:
['test'] # [tes] ![!word] # \w
Special characters are escaped when needed, but
^ is only escaped if it is the first character:
['-^&\'] # [\[\]\-^\&\\] ['^'] # [\^]
h are polyfilled in all flavors except PCRE and
Java. ASCII shorthands are polyfilled everywhere, even though they are supported in PCRE as “POSIX
Behavior is incorrect in .NET (see above).
Union and intersection of sets is not yet implemented.
['&' '&'-'Z'] miscompiles in JS with the
/v flag because
& is not escaped.
- Deprecated shorthands in character ranges in Pomsky 0.11
- Extended set of supported Unicode properties in Pomsky 0.10
- Added support for Unicode blocks and boolean properties in Pomsky 0.8
[cp]syntax in Pomsky 0.6
- Added shorthand aliases
vert_spacein Pomsky 0.3
- ASCII shorthands renamed to begin with
ascii_in Pomsky 0.3
- Initial implementation in Pomsky 0.1