Enable Unicode Support
Pomsky has good Unicode support, but you might still have to enable Unicode support in your regex engine. This document explains how to do that for various regex engines.
If some information here is missing, outdated or needs clarification, I would greatly appreciate your help! You can edit this file on GitHub.
Rust
The Rust regex
crate is Unicode-aware by default. There’s nothing you need to do.
JavaScript
In JavaScript, set the u
flag, for example /[\w\s]/u
. This makes it possible to use Unicode properties (\p{...}
) and code points outside of the BMP (\u{...}
).
Since \w
and \d
are not Unicode aware even when the u
flag is enabled, Pomsky polyfills them. However, word boundaries aren’t Unicode aware, so you need to disable Unicode to use them or use lookarounds.
disable unicode;
<'test'>
If you need Unicode-aware word boundaries, you can use the following instead of the <
and >
word boundaries:
let wstart = (!<< [w]) (>> [w]); # start of a word
let wend = (<< [w]) (!>> [w]); # end of a word
PHP
PHP is Unicode-aware if the u
flag is set, and this also applies to \w
,
\d
, \s
and \b
. For
example, '/\w+/u'
matches a word in any script.
Java, Kotlin, Scala
Add (?U)
in front of the regex to make it Unicode-aware. For
example, "(?U)\\w+"
matches a word in any script.
Ruby
In Ruby, add (?u)
in front of the regex to make it Unicode-aware. For
example, /(?u)\w+/
matches a word in any script.
Python
In the Python re
module, \w
, \d
,
\s
and \b
are Unicode-aware since Python 3.
If you’re still using Python 2, you can use the regex module from November 2021; releases newer than that don’t support Python 2.
Elixir
Regexes in Elixir are Unicode-aware if the u
flag is added. For example, ~r/\w+/u
matches a word in any script.
Erlang
You need to set the unicode
and ucp
options to make regexes Unicode aware. For example, re:compile("\\w+", [unicode, ucp])
matches a word in any script.
PCRE
PCRE supports Unicode, but to make \w
, \d
,
\s
and \b
Unicode-aware, you need to enable both
PCRE_UTF8
and PCRE_UCP
.