Pomsky 0.12
I’m happy to announce version 0.12 of Pomsky, the next-level regular expression language! Pomsky makes writing correct and maintainable regular expressions a breeze. Pomsky expressions are converted into regexes, which can be used with many different regex engines.
If you’re not familiar with Pomsky, here is a quick summary of how it compares to regular expressions.
What’s new?
Section titled “What’s new?”This release comes packed with new features and improvements. Here are the highlights:
-
The RE2 flavor, used for Go’s
regexppackage, is now supported -
Intersection of character sets
-
Improved Unicode support with Script Extensions
-
A convenient pomsky test subcommand for running unit tests
-
Powerful optimizations for repetitions, character sets, and alternatives
-
New IDE capabilities for the VSCode extension
-
New installers, including an npm package
You might have noticed that Pomsky has a new logo. I’ve also updated the website by switching from Hugo to Starlight, as the Hugo theme we were using was no longer being maintained.
This release took longer than usual because of a few unplanned delays. The last version was released almost two years ago! I had planned to cut a release by the end of 2024, but sometimes things don’t go as planned. The wait is finally over, so let’s take a look at the most exciting new features in this release!
RE2 Support
Section titled “RE2 Support”RE2 is a fast regex engine from Google. Unlike backtracking regex engines such as PCRE2, it is based on finite automata, so it has better worst-case performance. Pomsky now offers a RE2 flavor, which is also compatible with Go’s regexp package. Because the RE2 flavor doesn’t support advanced features such as lookbehind and backreferences, Pomsky produces an error when you try to use them:
> pomsky -f re2 "<< 'test'" error P0301(compat): × Unsupported feature `lookahead/behind` in the `RE2` regex flavor ╭──── 1 │ << 'test' · ────┬──── · ╰── error occurred here ╰────
With RE2, Pomsky now supports 8 regex flavors, covering most mainstream programming languages:
- PCRE (PHP, R, Erlang, …)
- JavaScript (TypeScript, Dart, …)
- Java (JVM languages)
- Python
- .NET (C#)
- Ruby
- Rust
- RE2 (Go)
Character Set Intersection
Section titled “Character Set Intersection”Several regex engines1 support intersecting character sets:
[\p{Thai}&&\p{Nd}]The above matches a codepoint that is in the ‘Thai’ script and in the ‘Nd’ (decimal number) category. Pomsky now has an & operator to express this:
[Thai] & [Nd]Some regex engines also support subtraction. Pomsky doesn’t offer this feature, but it can be easily emulated using negation:
[Thai] & ![Nd] # negating one character set subtracts itNote that not all flavors support intersection. However, if both character sets are negated, they are merged by applying De Morgan’s first law:
![Thai] & ![Nd] # is turned into...![Thai Nd]Unicode Script Extensions
Section titled “Unicode Script Extensions”Most software has to be able to handle text in different languages and writing systems. This is why I’ve always considered good Unicode support to be one of Pomsky’s strongest features. For example, Pomsky polyfills \w in JavaScript, which is surprisingly not Unicode aware, even with the unicode flag enabled.
Pomsky also makes it easy to match a code point in a particular Unicode script: For example, [Syriac] matches all Syriac characters – at least in theory. But Unicode scripts cannot overlap, so code points that would belong in multiple scripts are assigned to the Common or Inherited script instead.
Script Extensions solve this problem, which Pomsky now supports:
# matches all codepoints with a Syriac script extension[scx:Syriac]Because code points can have multiple scripts in their ‘Script Extensions’ property, this is more accurate.
Script Extensions are currently only supported in the PCRE, JavaScript and Rust flavors, but hopefully more regex engines will add support in the future.
In addition to scx:, you can also use the gc:, sc:, and blk: prefixes to match a general category, script, or block. These prefixes are optional, but adding them might help with readability:
[Letter] # old[gc:Letter] # new
[InBasic_Latin] # old[blk:Basic_Latin] # newNote that the In prefix of Unicode blocks is omitted when using the blk: prefix. It is recommended to use blk: because we will deprecate the In prefix in the future.
Test subcommand
Section titled “Test subcommand”Pomsky 0.11 added the test construct and a --test flag to run unit tests during compilation. This was a big step towards making Pomsky expressions more correct and maintainable. However, there was no easy way to test all Pomsky expressions in a project during a CI workflow. This has now been addressed with the new pomsky test command:
> pomsky test --path examples/ -e pcre2 testing examples/modes.pomsky ... ok testing examples/email.pomsky ... ok testing examples/repetitions.pomsky ... ok testing examples/version.pomsky ... ok testing examples/special.pomsky ... ok testing examples/groups.pomsky ... ok testing examples/strings.pomsky ... ok testing examples/capt_groups.pomsky ... ok test result: ok, 8 files tested in 1.41ms
pomsky test recursively iterates through the given directory, taking .ignore and .gitignore files into account, and tests all files ending with .pomsky. If at least one file doesn’t compile or contains a failing test, the program exits with an error, so it’s easy to include it in your CI pipeline.
Pomsky previously only supported pcre2 for running tests. In this release, we added rust as another option. We want to add more regex engines for unit tests, but this is proving tricky, so it didn’t make it into this release.
Optimizations
Section titled “Optimizations”Pomsky lets you to refactor parts of an expression into variables to improve readability and follow the DRY principle. For example:
let digit = ['0'-'9'];let hex_digit = digit | ['a'-'f'];But this comes with a trade-off: Sometimes the produced regex is less efficient and slower in some regex engines. Optimizations help with this, so you can write readable, DRY code without worrying too much about performance.
In this release, Pomsky gained some important optimizations:
-
Character ranges are merged if they are adjacent or overlap. For instance,
'a' | ['b'-'d' 'c'-'f']becomes just[a-f]. -
Alternations are merged if possible:
"case" | "char" | "const" | "continue"compiles toc(?:ase|har|on(?:st|tinue)).
Merging common prefixes can help performance in backtracking regex engines. Note that optimizations are applied after resolving variables, which is important for the example above.
Editor improvements
Section titled “Editor improvements”A key feature of any computer language is its IDE integration. For this release, we’ve added a few new features to the VSCode extension, including:
- Go to definition
- Find usages
- Rename variable
There’s just one caveat: for these features to work, the file has to be syntactically valid. This is because these actions work on the abstract syntax tree (AST), and we currently can’t get the AST if parsing fails.
The next step is therefore to make the parser recoverable, so that it can produce an AST even in the presence of syntax errors. This will also help to improve autocompletion, which currently doesn’t use the AST.
We’ve also added inlay hints to show the index of unnamed capturing groups, meaning you no longer need to count them when writing a replacement pattern.
New installers
Section titled “New installers”For this release, we adopted the wonderful cargo-dist for distributing the pomsky binary, so we could provide more installation options: In addition to the Windows, Linux and macOS binaries, we now have
-
Shell and PowerShell scripts to download and install Pomsky
-
An npm package, so you can install Pomsky with
Terminal window npm install -g @pomsky-lang/clior run it directly with
Terminal window npx @pomsky-lang/cli -
An msi for installing and uninstalling Pomsky on Windows
As before, you can also get Pomsky from the AUR with yay -S pomsky-bin, from crates.io with cargo install pomsky-bin, or from Homebrew.
Breaking change: Fixing hygiene
Section titled “Breaking change: Fixing hygiene”Pomsky variables work much like macros in Lisp and Rust: When Pomsky encounters a variable, it is substituted with its content. Variables are hygienic, which means they are properly scoped when substituted.
However, we discovered that modifiers are not hygienic. For example:
let variable = .*;
(enable lazy; variable)Is the repetition in line 1 lazy or not? In Pomsky 0.11, it was lazy, which is counterintuitive because the enable lazy; statement is not in scope where the repetition appears, only where the variable is used.
Unfortunately, fixing this required a breaking change. We think that the impact will be minimal, but to be sure, please check if you are relying on this behavior anywhere.
Other changes
Section titled “Other changes”See the full list of changes in the changelog.
Support us!
Section titled “Support us!”We have a lot of exciting plans to make Pomsky a success. To realize them, we need your help! But when I say ‘we’, that’s mostly just me, @Aloso, working on Pomsky in my spare time. I’m looking for contributors to implement new features, tooling, and integrations. If you’d like to help (or have questions, or just want to chat), drop by our Discord channel. Also, if you’re using Pomsky, I’d like to hear about it!
Consider sponsoring me to help making my open-source work financially sustainable. Thank you ❤️