Pomsky 0.8 released
Posted on December 22, 2022 by Ludwig Stecher ‐ 6 min read
The new logo: A pomeranian husky wearing orange glasses, created with the help of DALL·E 2
What is pomsky?
Pomsky is a portable, modern syntax for regular expressions. It has powerful features, such as variables, and a much more readable syntax. Check out the language tour to quickly get familiar with pomsky, or the examples to see some real pomsky expressions.
Hi! I’m Ludwig, and I created Pomsky in my spare time. I am pleased to announce that you can now sponsor me for my work! It’s been a lot of fun to develop Pomsky, but as Pomsky is being used by more and more people, I will need more time for maintenance and improving the developer experience. Donations will help me to invest the time needed. If you can spare a few bucks or convince your employer to sponsor me, that would go a long way to make Pomsky sustainable.
All sponsors will receive monthly updates on what I am working on, as I want to be transparent about how your money is used. Find out more on my GitHub sponsors page.
Summary of the changes
This release focuses on stability, but also brings a few new features:
Inline regexes landed, allowing you to use regex features not yet supported by Pomsky
Pomsky now supports the dot, matching any code point
An optimization pass that tries to make the generated regex smaller
Many bugs were fixed
Inline regexes are an escape hatch to include an arbitrary regex, at the cost of fewer compile-time guarantees. They are added to the output without any escaping; you could compare this to inline assembly in higher-level languages like C or Rust. Inline regexes allow you to produce hand-optimized code, or use regex features that aren’t properly supported yet. For example:
let recurse = regex '(?R)'; # inline regex for using recursion let number = '-'? [digit]+; let operator = '+' | '-' | '*' | '/'; let expr = number | '(' recurse ')'; expr (operator expr)* # matches an arbitrarily nested term
Note that recursion using the
(?R) syntax only works in the PCRE flavor.
Inline regexes are very powerful, but use them sparingly: Pomsky doesn’t parse their content, so it cannot warn you from mistakes. Pomsky usually guarantees that the regexes it generates are valid, but this guarantee may not hold if you’re using inline regexes.
You’re probably wondering: The dot is one of the most important regex features, so why was it missing until now? I have been hesitant to add it because it’s so easy to forget that the dot does not match line breaks by default. Also, the dot is not Unicode-aware in all regex variants, and all regex engines treat line breaks slightly differently. That’s why I initially didn’t include the dot, and recommended to use
Codepoint (abbreviated as
Codepoint won’t work in ERE (POSIX-extended regular expressions), which we want to support eventually. Also, I decided that best practices such as avoiding the dot are best enforced by a linter rather than the compiler, so specific lints can be disabled when necessary. Pomsky doesn’t have a linter yet, but it’s still on my todo list.
Pomsky now does an optimization pass right before code generation. Before, optimizations were intermingled with parsing and code generation, which led to some problems: Parsing wasn’t lossless (extra parentheses would get lost), and the code was hard to maintain and iterate on. The new optimization pass solves these problems while significantly improving the generated code in some cases:
- Subsequent repetitions are now merged whenever possible. For example,
- Duplicated items in a character set are removed. For example,
- Empty groups like
(() '')are removed, even when followed by a repetition.
More optimizations are planned: For example, we want to make it possible to combine alternatives with a common prefix, like
'make' | 'maple' | 'maker'. Ideally, this would produce the output
ma(?:ker??|ple), which is much faster in more mainstream1 regex engines that use backtracking instead of a DFA.
Testing improvements and bugfixes
For this release we improved our testing and fuzzing infrastructure. We have a test harness with several hundred hand-written test cases. For each test case, we check if compiling it produces the expected output, or fails with the expected error if the input is malformed. We also have a fuzzer, which generates test cases and systematically mutates them to discover new code paths. While the test harness verifies that the output is correct, the fuzzer can only verify that the program doesn’t crash, since we can’t manually check millions of generated test cases.
So to increase our confidence in Pomsky’s correctness, we started to compile regexes produced by Pomksy with the respective regex engine. For now, we only do this in the Rust and PCRE flavors, but we plan to compile more regex flavors in the future. This still doesn’t guarantee that the output is correct, but at least it checks if the output is syntactically valid. With this change, the fuzzer was able to find a few bugs during code generation.
We also started measuring test coverage and uploading it to coveralls.io. Looking at the reports helped us identify parts that needed more tests, and improving our test coverage. Unfortunately, the data isn’t as reliable as we’d hoped: Coveralls highlights many struct definitions and enum variants as “untested”, which is inaccurate. Measuring test coverage is still a net win, just don’t read too much into the numbers 😉
The binaries published on GitHub and the AUR are now built with
cargo audit, so we can be sure there are no known vulnerabilites when we publish a new version. Additionally, the list of dependencies is embedded in the binaries, so you can check with
cargo audit bin <PATH_TO_BINARY> if any vulnerabilities were found since then. cargo-audit is a tool that scans libraries for advisories reported in the rustsec advisory DB.
You can now disable all warnings with
-W0, compatibility warnings with
-Wcompat=0, and deprecation warnings with
-Wdeprecated=0. Needless to say that doing this is discouraged, except in edge cases. Warnings exist for a reason!
You can specify which features should be allowed with
--allowed-features. For example, with
--allowed-features boundaries,dot,lazy-mode,named-groups,ranges,references,variables, only the specified language features are available. Without this argument, all features can be used.
The help messages have been extended.
pomsky -h now gives a short summary, while
pomsky --help provides more detailed information.
You might remember that I replaced
clap with a more lightweight argument parsing library. The new library we’re using doesn’t support help generation, so I wrote a macro to generate help messages that can contain formatting and color. It is general enough that I might publish it as a separate crate, stay tuned!
Happy Hanukkah! Merry Christmas! Happy Holidays!
An earlier version of this post said that “traditional” regex engines use backtracking. That’s not quite correct; backtracking only became popular with Perl. All regex flavors supported by Pomsky use backtracking, except for Rust, which builds a lazy DFA. ↩︎