用正则表达式匹配CJK 中文字符,日文字符和韩文字符

2021-07-01  本文已影响0人  mudssky

用正则表达式匹配CJK 中文字符,日文字符和韩文字符

中文字符范围

详见unicode官网的一个文档

https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf

中文字符的范围比较广,而且汉字有很多是和日文还有韩文是通用的(The Unicode Standard contains a set of unified Han ideographic characters used in the written Chinese, Japanese, and Korean languages).

CJK指的就是中日韩

下面直接把官网关于汉语的表格搬过来了.

4E00–9FFF是最初修订的中文字符范围,包含了大部分常用内容了,所以一般用这个来匹配汉字就可以了,

区块 范围 简述
CJK Unified Ideographs 4E00–9FFF Common
CJK Unified Ideographs Extension A 3400–4DBF Rare
CJK Unified Ideographs Extension B 20000–2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E 2B820–2CEAF Rare, historic
CJK Unified Ideographs Extension F 2CEB0–2EBEF Rare, historic
CJK Unified Ideographs Extension G 30000–3134F Rare, historic
CJK Compatibility Ideographs F900–FAFF Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

日文字符范围

因为日语的汉字部分有很多和汉字的unicode码其实是公用的,所以实际上判断日文字符,你从平假名和片假名入手比较好.

韩文字符范围

下面是网上找到的匹配日文的正则,作为参考留着


Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])

Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])

Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])

Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])

Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])

Regex for matching Hirgana
([ぁ-ん])

Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])

Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])

Regex for matching full-width Numbers (zenkaku 全角)
([0-9])

Regex for matching full-width Letters (zenkaku 全角)
([A-z])

Regex for matching Hiragana codespace characters (includes non phonetic characters)
([ぁ-ゞ])

Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
([ァ-ヶ])

Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])

Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/

Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/

Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/
上一篇下一篇

猜你喜欢

热点阅读