Emoji表情的拦截

2017-12-14  本文已影响90人  Nagi

start

先看一个编码


编码示例

看到这个Code,如何解析?那就要先谈一谈编码空间了。

Emoji编码

emoji编码有2个字节的,也有4个字节的,当然都得大于0xd800,官方规定了高位从0xd800~0xdbff, 低位从0xdc00~0xdfff.


过滤

那么,如何过滤?先看看别人是怎么做的吧
http://www.jianshu.com/p/2597d4c3a183
其主要思路就是判断编码的范围
其中一段识别代码,是上面解析编码的逆运算:

const unichar hs = [substring characterAtIndex:0];
if (0xd800 <= hs && hs <= 0xdbff) {
         if (substring.length > 1) {
             const unichar ls = [substring characterAtIndex:1];
             const int uc = ((hs - 0xd800) * 0x400) + (ls - 0xdc00) + 0x10000;
             if (0x1d000 <= uc && uc <= 0x1f77f) {
                 isEomji = YES;
             }
         }
     }

再附加一些零星的判断

__block BOOL isEomji = NO;
[string enumerateSubstringsInRange:NSMakeRange(0, [string length]) options:NSStringEnumerationByComposedCharacterSequences usingBlock:
 ^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
     const unichar hs = [substring characterAtIndex:0];
     if (0xd800 <= hs && hs <= 0xdbff) {
         if (substring.length > 1) {
             const unichar ls = [substring characterAtIndex:1];
             const int uc = ((hs - 0xd800) * 0x400) + (ls - 0xdc00) + 0x10000;
             if (0x1d000 <= uc && uc <= 0x1f77f) {
                 isEomji = YES;
             }
         }
     } else if (substring.length > 1) {
         const unichar ls = [substring characterAtIndex:1];
         if (ls == 0x20e3) {
             isEomji = YES;
         }
     } else {
         if (0x2100 <= hs && hs <= 0x27ff && hs != 0x263b) {
             isEomji = YES;
         } else if (0x2B05 <= hs && hs <= 0x2b07) {
             isEomji = YES;
         } else if (0x2934 <= hs && hs <= 0x2935) {
             isEomji = YES;
         } else if (0x3297 <= hs && hs <= 0x3299) {
             isEomji = YES;
         } else if (hs == 0xa9 || hs == 0xae || hs == 0x303d || hs == 0x3030 || hs == 0x2b55 || hs == 0x2b1c || hs == 0x2b1b || hs == 0x2b50|| hs == 0x231a ) {
             isEomji = YES;
         }
     }
 }];

因为代码比较老,可能会遗漏,那么顺着思路把编码范围判断补全不就好了?于是找到官网看看

http://unicode.org/emoji/format.html#col-totals
https://apps.timwhitlock.info/emoji/tables/unicode#block-6c-other-additional-symbols

其编码范围比较飘逸,还老新增,要想简单搞一下基本不行,必定漏判、误判。维护也很蛋疼。

提个思路:把官网上编码都爬下来,生成数据库,判断的时候查一下,相当于搞了个字库,这样徒增应用大小了。

再换个思路,其实本来之所以要拦截,还是因为后台数据库默认不支持,并不是从使用上考虑要拦截,因此,还是让数据库取支持emoji的存储比较靠谱。

上一篇下一篇

猜你喜欢

热点阅读