Enable Unicode Support

Pomsky has good Unicode support, but you might still have to enable Unicode support in your regex engine. This document explains how to do that for various regex engines.

If some information here is missing, outdated or needs clarification, I would greatly appreciate your help! You can edit this file on GitHub.

Rust

The Rust regex crate is Unicode-aware by default. There’s nothing you need to do.

JavaScript

In JavaScript, set the u flag, for example /[\w\s]/u. This makes it possible to use Unicode properties (\p{...}) and code points outside of the BMP (\u{...}).

Since \w and \d are not Unicode aware even when the u flag is enabled, Pomsky polyfills them. However, word boundaries aren’t Unicode aware, so you need to disable Unicode to use them or use lookarounds.

disable unicode;
<'test'>

If you need Unicode-aware word boundaries, you can use the following instead of the < and > word boundaries:

let wstart = (!<< [w]) (>> [w]);  # start of a word
let wend   = (<< [w]) (!>> [w]);  # end of a word

PHP

PHP is Unicode-aware if the u flag is set, and this also applies to \w, \d, \s and \b. For example, '/\w+/u' matches a word in any script.

Java, Kotlin, Scala

Add (?U) in front of the regex to make it Unicode-aware. For example, "(?U)\\w+" matches a word in any script.

Ruby

In Ruby, add (?u) in front of the regex to make it Unicode-aware. For example, /(?u)\w+/ matches a word in any script.

Python

In the Python re module, \w, \d, \s and \b are Unicode-aware since Python 3.

If you’re still using Python 2, you can use the regex module from November 2021; releases newer than that don’t support Python 2.

Elixir

Regexes in Elixir are Unicode-aware if the u flag is added. For example, ~r/\w+/u matches a word in any script.

Erlang

You need to set the unicode and ucp options to make regexes Unicode aware. For example, re:compile("\\w+", [unicode, ucp]) matches a word in any script.

PCRE

PCRE supports Unicode, but to make \w, \d, \s and \b Unicode-aware, you need to enable both PCRE_UTF8 and PCRE_UCP.