数据蛙数据分析每周作业小秩学数据分析

SQL正则表达式(一)

2019-02-22  本文已影响0人  Lykit01

一、hive正则表达式

三个函数:regexp,regexp_extract,regexp_replace,

1.A regexp B功能与RLIKE相同

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'

2.regexp_replace

语法:regexp_replace(target_string,pattern,substitute_string)

select regexp_replace("IloveYou","You","")

out>>>"Ilove"

3.regexp_extract

语法:regexp_extract(target_string,pattern,index)
index是选择哪个分组(即括号括起来的),从1开始。

疑难问题实战:

比如我有像这样的数据:

'[{"houses":0,"States":[{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"}]},{"houses":1,"States":[{"state":"07","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"03","inventory":0,"houseCode":"xxx","houseName":"xxxx"}]}]'

我要从里面取出每个house的状态,然后综合各个house的状态,按照一定的逻辑去判断整体状态是什么。所以我要用正则去匹配。然后因为house的数目是不定的,有的数据的house可能有1一个,最多的有4个,所以要匹配出所有的state。因为数据库里一条数据只有一个,只能放一条匹配的数据,所以正则匹配的只显示第一次匹配的结果。
本来一筹莫展想额外建函数来写的,但是在公司小哥哥的指导下找到了解决方法。我们可以开多个列来储存多个匹配结果,这里最多有4个,就开4个。因为正则是优先匹配第一个,所以我在第一个表达式里只写一个组,第二表达式写两个组,以此类推。
然后把4个结果按照一定的逻辑判断整体的状态(用case when)。

select
regexp_extract(origin_info,'state":"(.*?)","inventory',1) as h1,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory',2) as h2,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',3) as h3,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',4) as h4
from test;
上一篇下一篇

猜你喜欢

热点阅读