Word Boundaries

Word boundaries match a position where a word starts or ends. Like anchors, they do not consume any characters – they have a length of 0. Expressions like this are called assertions.

There are three kinds of word boundaries:

  • < to match at the start of a word
  • > to match at the end of a word
  • % to match either at the start or at the end of a word.

For example, if you want to find occurrences of the word test, but do not want to match substrings in words like testament or detests, you need to add word boundaries:

<'test'>

To match multiple words, wrap an alternation in a group:

<('if' | 'else' | 'for' | 'while')>

What is a word boundary?

A word start boundary is a position followed, but not preceded by a word character. Likewise, a word end boundary is position preceded, but not followed by a word character.

“Word characters” include letters, digits, and underscores. Formally, word characters are the set of the following Unicode properties:

  • Alphabetic
  • Mark
  • Decimal_Number
  • Connector_Punctuation

You can match a word character with the [word] character set.

Note that word boundaries aren’t 100% accurate: For example, the word can't has 4 word boundaries: At the start, the end, and around the apostrophe. Some scripts (e.g. Chinese) don’t separate words by spaces, so no word boundaries can be detected.

Negation

The % word boundary can be negated as !%. This matches inside or outside of a word, but not at a word boundary.

Note about JavaScript

In JavaScript, word boundaries are never Unicode-aware, even when the u flag is set. That’s why Unicode must be disabled to use them:

disable unicode;

<'test'>

If you need Unicode-aware word boundaries, you can use the following variables instead of the < and > word boundaries:

let wstart = (!<< [w]) (>> [w]);  # start of a word
let wend   = (<< [w]) (!>> [w]);  # end of a word

wstart 'test' wend