用正则表达式匹配CJK 中文字符,日文字符和韩文字符
2021-07-01 本文已影响0人
mudssky
用正则表达式匹配CJK 中文字符,日文字符和韩文字符
中文字符范围
详见unicode官网的一个文档
https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf
中文字符的范围比较广,而且汉字有很多是和日文还有韩文是通用的(The Unicode Standard contains a set of unified Han ideographic characters used in the written Chinese, Japanese, and Korean languages).
CJK指的就是中日韩
下面直接把官网关于汉语的表格搬过来了.
4E00–9FFF是最初修订的中文字符范围,包含了大部分常用内容了,所以一般用这个来匹配汉字就可以了,
区块 | 范围 | 简述 |
---|---|---|
CJK Unified Ideographs | 4E00–9FFF | Common |
CJK Unified Ideographs Extension A | 3400–4DBF | Rare |
CJK Unified Ideographs Extension B | 20000–2A6DF | Rare, historic |
CJK Unified Ideographs Extension C | 2A700–2B73F | Rare, historic |
CJK Unified Ideographs Extension D | 2B740–2B81F | Uncommon, some in current use |
CJK Unified Ideographs Extension E | 2B820–2CEAF | Rare, historic |
CJK Unified Ideographs Extension F | 2CEB0–2EBEF | Rare, historic |
CJK Unified Ideographs Extension G | 30000–3134F | Rare, historic |
CJK Compatibility Ideographs | F900–FAFF | Duplicates, unifiable variants, corporate characters |
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants |
日文字符范围
因为日语的汉字部分有很多和汉字的unicode码其实是公用的,所以实际上判断日文字符,你从平假名和片假名入手比较好.
-
日语平假名的unicode码范围:
3040–309F
-
日语片假名的unicode码范围:
30A0–30FF
-
日文片假名拼音扩展:
31F0-31FF
韩文字符范围
-
韩文拼音:AC00-D7AF
-
韩文字母:1100-11FF
-
韩文兼容字母:3130-318F
下面是网上找到的匹配日文的正则,作为参考留着
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])
Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])
Regex for matching Hirgana
([ぁ-ん])
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])
Regex for matching full-width Numbers (zenkaku 全角)
([0-9])
Regex for matching full-width Letters (zenkaku 全角)
([A-z])
Regex for matching Hiragana codespace characters (includes non phonetic characters)
([ぁ-ゞ])
Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
([ァ-ヶ])
Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])
Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/
Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/