Word Boundaries
Word boundaries match a position where a word starts or ends. Like anchors, they do not consume any characters – they have a length of 0. Expressions like this are called assertions.
There are three kinds of word boundaries:
<
to match at the start of a word>
to match at the end of a word%
to match either at the start or at the end of a word.
For example, if you want to find occurrences of the word test
, but do not want to match
substrings in words like testament
or detests
, you need to add word boundaries:
<'test'>
To match multiple words, wrap an alternation in a group:
<('if' | 'else' | 'for' | 'while')>
What is a word boundary?
A word start boundary is a position followed, but not preceded by a word character. Likewise, a word end boundary is position preceded, but not followed by a word character.
“Word characters” include letters, digits, and underscores. Formally, word characters are the set of the following Unicode properties:
- Alphabetic
- Mark
- Decimal_Number
- Connector_Punctuation
You can match a word character with the [word]
character set.
Note that word boundaries aren’t 100% accurate: For example, the word can't
has 4 word boundaries:
At the start, the end, and around the apostrophe. Some scripts (e.g. Chinese) don’t separate words
by spaces, so no word boundaries can be detected.
Negation
The %
word boundary can be negated as !%
. This matches inside or
outside of a word, but not at a word boundary.
Note about JavaScript
In JavaScript, word boundaries are never Unicode-aware, even when the u
flag is set. That’s why
Unicode must be disabled to use them:
disable unicode;
<'test'>
If you need Unicode-aware word boundaries, you can use the following variables instead of the
<
and >
word boundaries:
let wstart = (!<< [w]) (>> [w]); # start of a word
let wend = (<< [w]) (!>> [w]); # end of a word
wstart 'test' wend