Unicode properties

Pomsky supports the following kinds of Unicode properties:

  • General categories
  • Scripts
  • Blocks
  • Other boolean properties

However, not all regex engines support all of them. In particular, blocks and other properties are poorly supported.

Categories

Every Unicode code point is in one of the following General Categories:

  • Letter
  • Mark
  • Number
  • Punctuation
  • Symbol
  • Separator
  • Other

Each of these categories is subdivided into smaller categories. More information on Wikipedia.

In Pomsky, you can match against categories in square brackets:

[Uppercase_Letter Mark]

Using the abbreviations, the above can be written as

[Lu M]
Show all 38 categories
AbbrLongDescription
LuUppercase_Letteran uppercase letter
LlLowercase_Lettera lowercase letter
LtTitlecase_Lettera digraphic character, with first part uppercase
LCCased_LetterLu | Ll | Lt
LmModifier_Lettera modifier letter
LoOther_Letterother letters, including syllables and ideographs
LLetterLu | Ll | Lt | Lm | Lo
MnNonspacing_Marka nonspacing combining mark (zero advance width)
McSpacing_Marka spacing combining mark (positive advance width)
MeEnclosing_Markan enclosing combining mark
MMarkMn | Mc | Me
NdDecimal_Numbera decimal digit
NlLetter_Numbera letterlike numeric character
NoOther_Numbera numeric character of other type
NNumberNd | Nl | No
PcConnector_Punctuationa connecting punctuation mark, like a tie
PdDash_Punctuationa dash or hyphen punctuation mark
PsOpen_Punctuationan opening punctuation mark (of a pair)
PeClose_Punctuationa closing punctuation mark (of a pair)
PiInitial_Punctuationan initial quotation mark
PfFinal_Punctuationa final quotation mark
PoOther_Punctuationa punctuation mark of other type
PPunctuationPc | Pd | Ps | Pe | Pi | Pf | Po
SmMath_Symbola symbol of mathematical use
ScCurrency_Symbola currency sign
SkModifier_Symbola non-letterlike modifier symbol
SoOther_Symbola symbol of other type
SSymbolSm | Sc | Sk | So
ZsSpace_Separatora space character (of various non-zero widths)
ZlLine_SeparatorU+2028 LINE SEPARATOR only
ZpParagraph_SeparatorU+2029 PARAGRAPH SEPARATOR only
ZSeparatorZs | Zl | Zp
CcControla C0 or C1 control code
CfFormata format control character
CsSurrogatea surrogate code point
⚠️ not supported in Rust
CoPrivate_Usea private-use character
CnUnassigneda reserved unassigned code point or a noncharacter
COtherCc | Cf | Cs | Co | Cn

Support

PCREJavaScriptJavaRubyRust.NETPython

Rust does not support the Surrogate category, because it is always Unicode aware and UTF-16 surrogates are not valid Unicode scalar values.

Scripts

A script is a collection of code points used to represent textual information in one or more writing systems.

As with categories, code points can only be assigned to a single script. Code points used in multiple scripts are therefore assigned to the special script Common. More information on Wikipedia.

Show all 164 scripts
AbbrLong / Notes
AdlmAdlam
AghbCaucasian_Albanian
AhomAhom
ArabArabic
ArmiImperial_Aramaic
ArmnArmenian
AvstAvestan
BaliBalinese
BamuBamum
BassBassa_Vah
BatkBatak
BengBengali
BhksBhaiksuki
BopoBopomofo
BrahBrahmi
BraiBraille
BugiBuginese
BuhdBuhid
CakmChakma
CansCanadian_Aboriginal
CariCarian
ChamCham
CherCherokee
ChrsChorasmian
CoptCoptic, Qaac
CpmnCypro_Minoan
CprtCypriot
CyrlCyrillic
DevaDevanagari
DiakDives_Akuru
DogrDogra
DsrtDeseret
DuplDuployan
EgypEgyptian_Hieroglyphs
ElbaElbasan
ElymElymaic
EthiEthiopic
GeorGeorgian
GlagGlagolitic
GongGunjala_Gondi
GonmMasaram_Gondi
GothGothic
GranGrantha
GrekGreek
GujrGujarati
GuruGurmukhi
HangHangul
HaniHan
HanoHanunoo
HatrHatran
HebrHebrew
HiraHiragana
HluwAnatolian_Hieroglyphs
HmngPahawh_Hmong
HmnpNyiakeng_Puachue_Hmong
HungOld_Hungarian
ItalOld_Italic
JavaJavanese
KaliKayah_Li
KanaKatakana
KawiKawi
⚠️ not supported by PCRE, Java and Ruby
KharKharoshthi
KhmrKhmer
KhojKhojki
KitsKhitan_Small_Script
KndaKannada
KthiKaithi
LanaTai_Tham
LaooLao
LatnLatin
LepcLepcha
LimbLimbu
LinaLinear_A
LinbLinear_B
LisuLisu
LyciLycian
LydiLydian
MahjMahajani
MakaMakasar
MandMandaic
ManiManichaean
MarcMarchen
MedfMedefaidrin
MendMende_Kikakui
MercMeroitic_Cursive
MeroMeroitic_Hieroglyphs
MlymMalayalam
ModiModi
MongMongolian
MrooMro
MteiMeetei_Mayek
MultMultani
MymrMyanmar
NagmNag_Mundari
⚠️ not supported by PCRE, Java and Ruby
NandNandinagari
NarbOld_North_Arabian
NbatNabataean
NewaNewa
NkooNko
NshuNushu
OgamOgham
OlckOl_Chiki
OrkhOld_Turkic
OryaOriya
OsgeOsage
OsmaOsmanya
OugrOld_Uyghur
PalmPalmyrene
PaucPau_Cin_Hau
PermOld_Permic
PhagPhags_Pa
PhliInscriptional_Pahlavi
PhlpPsalter_Pahlavi
PhnxPhoenician
PlrdMiao
PrtiInscriptional_Parthian
RjngRejang
RohgHanifi_Rohingya
RunrRunic
SamrSamaritan
SarbOld_South_Arabian
SaurSaurashtra
SgnwSignWriting
ShawShavian
ShrdSharada
SiddSiddham
SindKhudawadi
SinhSinhala
SogdSogdian
SogoOld_Sogdian
SoraSora_Sompeng
SoyoSoyombo
SundSundanese
SyloSyloti_Nagri
SyrcSyriac
TagbTagbanwa
TakrTakri
TaleTai_Le
TaluNew_Tai_Lue
TamlTamil
TangTangut
TavtTai_Viet
TeluTelugu
TfngTifinagh
TglgTagalog
ThaaThaana
ThaiThai
TibtTibetan
TirhTirhuta
TnsaTangsa
TotoToto
UgarUgaritic
VaiiVai
VithVithkuqi
WaraWarang_Citi
WchoWancho
XpeoOld_Persian
XsuxCuneiform
YeziYezidi
YiiiYi
ZanbZanabazar_Square
ZinhInherited
ZyyyCommon
ZzzzUnknown
⚠️ not supported by Rust

Support

PCREJavaScriptJavaRubyRust.NETPython

Kawi and Nag_Mundari, added in Unicode 15.0, are not yet supported in PCRE, Java and Ruby.

Zzzz (Unknown) is not supported in Rust.

JavaScript supports all scripts as of Unicode 15.0.

Blocks

The Unicode character set is divided into blocks of consecutive code points that usually belong to the same script or serve a similar purpose.

There are often multiple blocks for a script. For example, there are 10 designated blocks for Latin code points: Basic_Latin, Latin_1_Supplement, Latin_Extended_Additional, and Latin_Extended_A through Latin_Extended_G. Furthermore, many blocks contain two or more scripts, which is not always clear from the name. For example, Latin_Extended_E includes a Greek code point.

It is almost always better to use the script rather than the block, but Pomsky still supports blocks using the In prefix:

# matches code points in the `Basic_Latin` block
[InBasic_Latin]
Show all 328 blocks
Names
Adlam
Aegean_Numbers
Ahom
Alchemical, Alchemical_Symbols
Alphabetic_PF, Alphabetic_Presentation_Forms
Anatolian_Hieroglyphs
Ancient_Greek_Music, Ancient_Greek_Musical_Notation
Ancient_Greek_Numbers
Ancient_Symbols
Arabic
Arabic_Ext_A, Arabic_Extended_A
Arabic_Ext_B, Arabic_Extended_B
Arabic_Ext_C, Arabic_Extended_C
Arabic_Math, Arabic_Mathematical_Alphabetic_Symbols
Arabic_PF_A, Arabic_Presentation_Forms_A
Arabic_PF_B, Arabic_Presentation_Forms_B
Arabic_Sup, Arabic_Supplement
Armenian
Arrows
ASCII, Basic_Latin
Avestan
Balinese
Bamum
Bamum_Sup, Bamum_Supplement
Bassa_Vah
Batak
Bengali
Bhaiksuki
Block_Elements
Bopomofo
Bopomofo_Ext, Bopomofo_Extended
Box_Drawing
Brahmi
Braille, Braille_Patterns
Buginese
Buhid
Byzantine_Music, Byzantine_Musical_Symbols
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Cherokee_Sup, Cherokee_Supplement
Chess_Symbols
Chorasmian
CJK, CJK_Unified_Ideographs
CJK_Compat, CJK_Compatibility
CJK_Compat_Forms, CJK_Compatibility_Forms
CJK_Compat_Ideographs, CJK_Compatibility_Ideographs
CJK_Compat_Ideographs_Sup, CJK_Compatibility_Ideographs_Supplement
CJK_Ext_A, CJK_Unified_Ideographs_Extension_A
CJK_Ext_B, CJK_Unified_Ideographs_Extension_B
CJK_Ext_C, CJK_Unified_Ideographs_Extension_C
CJK_Ext_D, CJK_Unified_Ideographs_Extension_D
CJK_Ext_E, CJK_Unified_Ideographs_Extension_E
CJK_Ext_F, CJK_Unified_Ideographs_Extension_F
CJK_Ext_G, CJK_Unified_Ideographs_Extension_G
CJK_Ext_H, CJK_Unified_Ideographs_Extension_H
CJK_Radicals_Sup, CJK_Radicals_Supplement
CJK_Strokes
CJK_Symbols, CJK_Symbols_And_Punctuation
Compat_Jamo, Hangul_Compatibility_Jamo
Control_Pictures
Coptic
Coptic_Epact_Numbers
Counting_Rod, Counting_Rod_Numerals
Cuneiform
Cuneiform_Numbers, Cuneiform_Numbers_And_Punctuation
Currency_Symbols
Cypriot_Syllabary
Cypro_Minoan
Cyrillic
Cyrillic_Ext_A, Cyrillic_Extended_A
Cyrillic_Ext_B, Cyrillic_Extended_B
Cyrillic_Ext_C, Cyrillic_Extended_C
Cyrillic_Ext_D, Cyrillic_Extended_D
Cyrillic_Sup, Cyrillic_Supplement, Cyrillic_Supplementary
Deseret
Devanagari
Devanagari_Ext, Devanagari_Extended
Devanagari_Ext_A, Devanagari_Extended_A
Diacriticals, Combining_Diacritical_Marks
Diacriticals_Ext, Combining_Diacritical_Marks_Extended
Diacriticals_For_Symbols, Combining_Diacritical_Marks_For_Symbols, Combining_Marks_For_Symbols
Diacriticals_Sup, Combining_Diacritical_Marks_Supplement
Dingbats
Dives_Akuru
Dogra
Domino, Domino_Tiles
Duployan
Early_Dynastic_Cuneiform
Egyptian_Hieroglyph_Format_Controls
Egyptian_Hieroglyphs
Elbasan
Elymaic
Emoticons
Enclosed_Alphanum, Enclosed_Alphanumerics
Enclosed_Alphanum_Sup, Enclosed_Alphanumeric_Supplement
Enclosed_CJK, Enclosed_CJK_Letters_And_Months
Enclosed_Ideographic_Sup, Enclosed_Ideographic_Supplement
Ethiopic
Ethiopic_Ext, Ethiopic_Extended
Ethiopic_Ext_A, Ethiopic_Extended_A
Ethiopic_Ext_B, Ethiopic_Extended_B
Ethiopic_Sup, Ethiopic_Supplement
Geometric_Shapes
Geometric_Shapes_Ext, Geometric_Shapes_Extended
Georgian
Georgian_Ext, Georgian_Extended
Georgian_Sup, Georgian_Supplement
Glagolitic
Glagolitic_Sup, Glagolitic_Supplement
Gothic
Grantha
Greek, Greek_And_Coptic
Greek_Ext, Greek_Extended
Gujarati
Gunjala_Gondi
Gurmukhi
Half_And_Full_Forms, Halfwidth_And_Fullwidth_Forms
Half_Marks, Combining_Half_Marks
Hangul, Hangul_Syllables
Hanifi_Rohingya
Hanunoo
Hatran
Hebrew
High_PU_Surrogates, High_Private_Use_Surrogates
High_Surrogates
Hiragana
IDC, Ideographic_Description_Characters
Ideographic_Symbols, Ideographic_Symbols_And_Punctuation
Imperial_Aramaic
Indic_Number_Forms, Common_Indic_Number_Forms
Indic_Siyaq_Numbers
Inscriptional_Pahlavi
Inscriptional_Parthian
IPA_Ext, IPA_Extensions
Jamo, Hangul_Jamo
Jamo_Ext_A, Hangul_Jamo_Extended_A
Jamo_Ext_B, Hangul_Jamo_Extended_B
Javanese
Kaithi
Kaktovik_Numerals
Kana_Ext_A, Kana_Extended_A
Kana_Ext_B, Kana_Extended_B
Kana_Sup, Kana_Supplement
Kanbun
Kangxi, Kangxi_Radicals
Kannada
Katakana
Katakana_Ext, Katakana_Phonetic_Extensions
Kawi
Kayah_Li
Kharoshthi
Khitan_Small_Script
Khmer
Khmer_Symbols
Khojki
Khudawadi
Lao
Latin_1_Sup, Latin_1_Supplement , Latin_1
Latin_Ext_A, Latin_Extended_A
Latin_Ext_Additional, Latin_Extended_Additional
Latin_Ext_B, Latin_Extended_B
Latin_Ext_C, Latin_Extended_C
Latin_Ext_D, Latin_Extended_D
Latin_Ext_E, Latin_Extended_E
Latin_Ext_F, Latin_Extended_F
Latin_Ext_G, Latin_Extended_G
Lepcha
Letterlike_Symbols
Limbu
Linear_A
Linear_B_Ideograms
Linear_B_Syllabary
Lisu
Lisu_Sup, Lisu_Supplement
Low_Surrogates
Lycian
Lydian
Mahajani
Mahjong, Mahjong_Tiles
Makasar
Malayalam
Mandaic
Manichaean
Marchen
Masaram_Gondi
Math_Alphanum, Mathematical_Alphanumeric_Symbols
Math_Operators, Mathematical_Operators
Mayan_Numerals
Medefaidrin
Meetei_Mayek
Meetei_Mayek_Ext, Meetei_Mayek_Extensions
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Misc_Arrows, Miscellaneous_Symbols_And_Arrows
Misc_Math_Symbols_A, Miscellaneous_Mathematical_Symbols_A
Misc_Math_Symbols_B, Miscellaneous_Mathematical_Symbols_B
Misc_Pictographs, Miscellaneous_Symbols_And_Pictographs
Misc_Symbols, Miscellaneous_Symbols
Misc_Technical, Miscellaneous_Technical
Modi
Modifier_Letters, Spacing_Modifier_Letters
Modifier_Tone_Letters
Mongolian
Mongolian_Sup, Mongolian_Supplement
Mro
Multani
Music, Musical_Symbols
Myanmar
Myanmar_Ext_A, Myanmar_Extended_A
Myanmar_Ext_B, Myanmar_Extended_B
Nabataean
Nag_Mundari
Nandinagari
NB, No_Block
New_Tai_Lue
Newa
NKo
Number_Forms
Nushu
Nyiakeng_Puachue_Hmong
OCR, Optical_Character_Recognition
Ogham
Ol_Chiki
Old_Hungarian
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_Sogdian
Old_South_Arabian
Old_Turkic
Old_Uyghur
Oriya
Ornamental_Dingbats
Osage
Osmanya
Ottoman_Siyaq_Numbers
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phaistos, Phaistos_Disc
Phoenician
Phonetic_Ext, Phonetic_Extensions
Phonetic_Ext_Sup, Phonetic_Extensions_Supplement
Playing_Cards
Psalter_Pahlavi
PUA, Private_Use_Area, Private_Use
Punctuation, General_Punctuation
Rejang
Rumi, Rumi_Numeral_Symbols
Runic
Samaritan
Saurashtra
Sharada
Shavian
Shorthand_Format_Controls
Siddham
Sinhala
Sinhala_Archaic_Numbers
Small_Forms, Small_Form_Variants
Small_Kana_Ext, Small_Kana_Extension
Sogdian
Sora_Sompeng
Soyombo
Specials
Sundanese
Sundanese_Sup, Sundanese_Supplement
Sup_Arrows_A, Supplemental_Arrows_A
Sup_Arrows_B, Supplemental_Arrows_B
Sup_Arrows_C, Supplemental_Arrows_C
Sup_Math_Operators, Supplemental_Mathematical_Operators
Sup_PUA_A, Supplementary_Private_Use_Area_A
Sup_PUA_B, Supplementary_Private_Use_Area_B
Sup_Punctuation, Supplemental_Punctuation
Sup_Symbols_And_Pictographs, Supplemental_Symbols_And_Pictographs
Super_And_Sub, Superscripts_And_Subscripts
Sutton_SignWriting
Syloti_Nagri
Symbols_And_Pictographs_Ext_A, Symbols_And_Pictographs_Extended_A
Symbols_For_Legacy_Computing
Syriac
Syriac_Sup, Syriac_Supplement
Tagalog
Tagbanwa
Tags
Tai_Le
Tai_Tham
Tai_Viet
Tai_Xuan_Jing, Tai_Xuan_Jing_Symbols
Takri
Tamil
Tamil_Sup, Tamil_Supplement
Tangsa
Tangut
Tangut_Components
Tangut_Sup, Tangut_Supplement
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Toto
Transport_And_Map, Transport_And_Map_Symbols
UCAS, Unified_Canadian_Aboriginal_Syllabics, Canadian_Syllabics
UCAS_Ext, Unified_Canadian_Aboriginal_Syllabics_Extended
UCAS_Ext_A, Unified_Canadian_Aboriginal_Syllabics_Extended_A
Ugaritic
Vai
Vedic_Ext, Vedic_Extensions
Vertical_Forms
Vithkuqi
VS, Variation_Selectors
VS_Sup, Variation_Selectors_Supplement
Wancho
Warang_Citi
Yezidi
Yi_Radicals
Yi_Syllables
Yijing, Yijing_Hexagram_Symbols
Zanabazar_Square
Znamenny_Music, Znamenny_Musical_Notation

Support

PCREJavaScriptJavaRubyRust.NETPython

Java doesn’t support the following blocks:

  • Arabic_Extended_C
  • CJK_Unified_Ideographs_Extension_H
  • Combining_Diacritical_Marks_For_Symbols
  • Cyrillic_Extended_D
  • Cyrillic_Supplementary
  • Devanagari_Extended_A
  • Greek_And_Coptic
  • Kaktovik_Numerals
  • No_Block

Ruby doesn’t support the following blocks:

  • Arabic_Extended_C
  • CJK_Unified_Ideographs_Extension_H
  • Cyrillic_Extended_D
  • Devanagari_Extended_A
  • Kaktovik_Numerals

PCRE and Rust both support all blocks as of Unicode 15.0.

Other properties

There are a number of boolean properties (meaning they are either Yes or No), which you can use in Pomsky by simply putting them in square brackets:

# match code points with Diacritic=Yes
[Diacritic]
Show all 53 other properties
AbbrLong
ASCIIASCII
AHexASCII_Hex_Digit
AlphaAlphabetic
AnyAny
AssignedAssigned
Bidi_CBidi_Control
Bidi_MBidi_Mirrored
CICase_Ignorable
CasedCased
CWCFChanges_When_Casefolded
CWCMChanges_When_Casemapped
CWLChanges_When_Lowercased
CWKCFChanges_When_NFKC_Casefolded
CWTChanges_When_Titlecased
CWUChanges_When_Uppercased
DashDash
DIDefault_Ignorable_Code_Point
DepDeprecated
DiaDiacritic
EmojiEmoji
ECompEmoji_Component
EModEmoji_Modifier
EBaseEmoji_Modifier_Base
EPresEmoji_Presentation
ExtPictExtended_Pictographic
ExtExtender
Gr_BaseGrapheme_Base
Gr_ExtGrapheme_Extend
HexHex_Digit
IDSBIDS_Binary_Operator
IDSTIDS_Trinary_Operator
IDCID_Continue
IDSID_Start
IdeoIdeographic
Join_CJoin_Control
LOELogical_Order_Exception
LowerLowercase
MathMath
NCharNoncharacter_Code_Point
Pat_SynPattern_Syntax
Pat_WSPattern_White_Space
QMarkQuotation_Mark
RadicalRadical
RIRegional_Indicator
STermSentence_Terminal
SDSoft_Dotted
TermTerminal_Punctuation
UIdeoUnified_Ideograph
UpperUppercase
VSVariation_Selector
spaceWhite_Space
XIDCXID_Continue
XIDSXID_Start

Support

PCREJavaScriptJavaRubyRust.NETPython

PCRE doesn’t support the following blocks:

  • Assigned
  • Changes_When_NFKC_Casefolded

Ruby doesn’t support the following blocks:

  • Bidi_Mirrored
  • Changes_When_NFKC_Casefolded

Rust doesn’t support the following blocks:

  • Changes_When_NFKC_Casefolded

JavaScript supports all boolean properties as of Unicode 15.0.