Lookaround

Lookarounds allow you to see if the characters before or after the current position match a certain expression. Lookarounds are assertions, meaning that they have a length of 0, much like anchors and word boundaries.

Example

Let’s say we want to match all keys in a JSON file. JSON is a simple, structured text format, e.g.

{
  "languages": [
    {
      "name": "Pomsky",
      "proficiency": "expert",
      "open_source": true
    }
  ]
}

To match all keys, we need to look for strings followed by a :. However, we don’t want the colon to be part of our match; we just want to check that it’s there! Here’s a possible solution:

'"' !['"']* '"'  (>> [space]* ':')

The >> introduces a lookahead assertion. It checks that the " is followed by a :, possibly with spaces in between. The contents of the lookahead are not included in the match.

But what if there’s a key containing escaped quotes, e.g. "foo \"bar\" baz"? To handle this, we need to allow escape sequences in the string:

'"' (!['\"'] | '\\' | '\"')* '"'  (>> [space]* ':')

There’s just one piece missing: The first quote should not be preceded by a backslash, so we need another assertion:

(!<< '\')  '"' (!['\"'] | '\\' | '\"')* '"'  (>> [space]* ':')

This is a negative lookbehind assertion. It asserts that the string is not preceded by the contained expression.

In total, there are 4 kinds of lookaround assertions:

  • >> (positive lookahead)
  • << (positive lookbehind)
  • !>> (negative lookahead)
  • !<< (negative lookbehind)

Note that lookbehind isn’t supported everywhere. In Safari, support was added recently, but older versions of Safari don’t support lookbehind. Rust supports neither lookbehind nor lookahead.

Intersection expressions

Lookaround makes it possible to simultaneously match a string in multiple ways. For example:

< (!>> ('_' | 'for' | 'while' | 'if') >) [word]+ >

This matches a string consisting of word characters, but not one of the keywords _, for, while and if.

Be careful when using this technique, because the lookahead might not match the same length as the expression after it. Here, we ensured that both match until the word end with >.