Java 正则表达式处理汉字文本

2018-05-08  本文已影响0人  李2牛

整理并验证自cat_book_milk的博客

  1. 判断字符串是否为纯汉字(正则表达式匹配多个汉字)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com 
* Created Time: Mon May  7 20:18:46 2018
************************************************************************/

public class TestChineseInJava{
    public static void main(String[] args){
        String allAscii = "China will win in the war of trade with the U.S.A.";     
        String allChinese = "爱我中华智造国芯";
        String chineseWithComma = "全角逗号能否匹配为汉字,呢";
        String mixed = "芯片是 IT 行业的命脉";
        String regex = "[\\u4e00-\\u9fa5]+";
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));

        System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
        System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));


    }
}

运行结果:

运行结果
中文汉字的Unicode 编码从 \ue400\u9fa5,所以使用 [\\u4e00-\\u9fa5]+可以匹配多个汉字。
字体编辑中日韩 Unicode编码表
但是全角字符不在匹配之列
  1. 提取字符串中的中文汉字(使用 replaceAll函数替换非汉字字符)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com 
* Created Time: Mon May  7 20:18:46 2018
************************************************************************/

public class TestChineseInJava{
    public static void main(String[] args){
        String allAscii = "China will win in the war of trade with the U.S.A.";     
        String allChinese = "爱我中华智造国芯";
        String chineseWithComma = "全角逗号能否匹配为汉字,呢";
        String mixed = "芯片是 IT 行业的命脉";
        String regex = "[\\u4e00-\\u9fa5]+";
        String regex2 = "[^\\u4e00-\\u9fa5]";//匹配非汉字
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
        System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
        System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
        System.out.println("retrive pure chinese character:"+mixed.replaceAll(regex2,""));
    }
}

关键在于将非汉字的字符替换为空字符就可以实现提取汉字的效果


提取汉字
  1. 判断是否有汉字(利用编码的长度区别)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com 
* Created Time: Mon May  7 20:18:46 2018
************************************************************************/

public class TestChineseInJava{
    public static void main(String[] args){
        String allAscii = "China will win in the war of trade with the U.S.A.";     
        String allChinese = "爱我中华智造国芯";
        String chineseWithComma = "全角逗号能否匹配为汉字,呢";
        String mixed = "芯片是 IT 行业的命脉";
        String regex = "[\\u4e00-\\u9fa5]+";
        String regex2 = "[^\\u4e00-\\u9fa5]";
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
        System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
        System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
        System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
        System.out.println("retrive pure chinese characters:"+mixed.replaceAll(regex2,""));
        System.out.println("true means no any chinese character,or there are.Speaking of mixed:"+(mixed.length() == mixed.getBytes().length));

    }
}
运行结果
  1. 汉字的个数(正则表达式匹配)
/*************************************************************************
* File Name: getChineseCharacters.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com 
* Created Time: Mon May  7 21:23:03 2018
************************************************************************/
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class getChineseCharacters{

    public static void main(String[] args){
        int count = 0;              
        String allAscii = "China will win in the war of trade with the U.S.A.";     
        String allChinese = "爱我中华智造国芯";
        String chineseWithComma = "全角逗号能否匹配为汉字,呢";
        String mixed = "芯片是 IT 行业的命脉,所以我们无论如何都不能放弃自主芯片的研究";
        String motto = "历史告诉我们中国必须走独立自主的道路:赫鲁晓夫曾说苏联拥核可以保护中国,劝中国不要研究核武,但是很快中苏交恶。国与国没有永远的蜜月,可以信任依靠的只有万众一心的人民";
        String regex = "[\\u4e00-\\u9fa5]";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(mixed);
        while(matcher.find()){
            for(int i = 0;i <= matcher.groupCount();i++){
                count++;
            }
        }
        System.out.println("there are  "+count+" 汉字 in mixed: "+mixed);
        count = 0;      
        matcher = pattern.matcher(motto);
        while(matcher.find()){
            for(int i = 0;i <= matcher.groupCount();i++){
                count++;
            }
        }
        System.out.println("there are  "+count+" 汉字 in motto: "+motto);
        
        count = 0;
        matcher = pattern.matcher(allChinese);
        while(matcher.find()){
            for(int i = 0;i <= matcher.groupCount();i++){
                count++;
            }
        }
        System.out.println("there are  "+count+" 汉字 in allChinese: "+allChinese);
    }
}

有一个奇怪的现象:
for(int i = 0;i <= matcher.groupCount();i++){ //如果没有= 就无法得出正确的数字
待解决
次日序:
看了下 javadoc Matcher.find()说明,find 类似于Scanner.hasNextInt()寻找符合匹配 Pattern 的下一个结果。groupCount() 一直返回的是 0,所以如果不加 = 自然得不到正确的结果。详情请看Matcher.groupCount()说明
如果需要计数匹配的个数可以使用的另外一种表达为:

int count = 0;
while(matcher.find()){
    count++;
}
image.png
上一篇 下一篇

猜你喜欢

热点阅读