Word boundaries match a position where a word starts or ends. Like anchors, they do not consume any characters – they have a length of 0. Expressions like this are called assertions.
There are three kinds of word boundaries:
<to match at the start of a word
>to match at the end of a word
%to match either at the start or at the end of a word.
For example, if you want to find occurrences of the word
test, but do not want to match
substrings in words like
detests, you need to add word boundaries:
To match multiple words, wrap an alternation in a group:
<('if' | 'else' | 'for' | 'while')>
What is a word boundary?
A word start boundary is a position followed, but not preceded by a word character. Likewise, a word end boundary is position preceded, but not followed by a word character.
“Word characters” include letters, digits, and underscores. Formally, word characters are the set of the following Unicode properties:
You can match a word character with the
[word] character set.
Note that word boundaries aren’t 100% accurate: For example, the word
can't has 4 word boundaries:
At the start, the end, and around the apostrophe. Some scripts (e.g. Chinese) don’t separate words
by spaces, so no word boundaries can be detected.
% word boundary can be negated as
!%. This matches inside or
outside of a word, but not at a word boundary.
u flag is set. That’s why
Unicode must be disabled to use them:
disable unicode; <'test'>
If you need Unicode-aware word boundaries, you can use the following variables instead of the
> word boundaries:
let wstart = (!<< [w]) (>> [w]); # start of a word let wend = (<< [w]) (!>> [w]); # end of a word wstart 'test' wend