用正则表达式匹配CJK 中文字符,日文字符和韩文字符

2021-07-01 本文已影响0人 mudssky

用正则表达式匹配CJK 中文字符,日文字符和韩文字符

中文字符范围

详见unicode官网的一个文档

https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf

中文字符的范围比较广,而且汉字有很多是和日文还有韩文是通用的(The Unicode Standard contains a set of unified Han ideographic characters used in the written Chinese, Japanese, and Korean languages).

CJK指的就是中日韩

下面直接把官网关于汉语的表格搬过来了.

4E00–9FFF是最初修订的中文字符范围,包含了大部分常用内容了,所以一般用这个来匹配汉字就可以了,

区块	范围	简述
CJK Unified Ideographs	4E00–9FFF	Common
CJK Unified Ideographs Extension A	3400–4DBF	Rare
CJK Unified Ideographs Extension B	20000–2A6DF	Rare, historic
CJK Unified Ideographs Extension C	2A700–2B73F	Rare, historic
CJK Unified Ideographs Extension D	2B740–2B81F	Uncommon, some in current use
CJK Unified Ideographs Extension E	2B820–2CEAF	Rare, historic
CJK Unified Ideographs Extension F	2CEB0–2EBEF	Rare, historic
CJK Unified Ideographs Extension G	30000–3134F	Rare, historic
CJK Compatibility Ideographs	F900–FAFF	Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement	2F800–2FA1F	Unifiable variants

日文字符范围

因为日语的汉字部分有很多和汉字的unicode码其实是公用的,所以实际上判断日文字符,你从平假名和片假名入手比较好.

日语平假名的unicode码范围:3040–309F
日语片假名的unicode码范围:30A0–30FF
日文片假名拼音扩展：31F0-31FF

韩文字符范围

韩文拼音：AC00-D7AF
韩文字母：1100-11FF
韩文兼容字母：3130-318F

下面是网上找到的匹配日文的正则,作为参考留着


Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])

Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])

Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])

Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])

Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン！：／])

Regex for matching Hirgana
([ぁ-ん])

Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])

Regex for matching half-width Katakana (hankaku 半角)
([ｧ-ﾝﾞﾟ])

Regex for matching full-width Numbers (zenkaku 全角)
([０-９])

Regex for matching full-width Letters (zenkaku 全角)
([Ａ-ｚ])

Regex for matching Hiragana codespace characters (includes non phonetic characters)
([ぁ-ゞ])

Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
([ァ-ヶ])

Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
([ｦ-ﾟ])

Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/

Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/

Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/

用正则表达式匹配CJK 中文字符,日文字符和韩文字符

用正则表达式匹配CJK 中文字符,日文字符和韩文字符

中文字符范围

日文字符范围

韩文字符范围

猜你喜欢

热点阅读