SQL正则表达式（一）

2019-02-22 本文已影响0人 Lykit01

一、hive正则表达式

三个函数：regexp,regexp_extract,regexp_replace,

1.A regexp B功能与RLIKE相同

select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'

2.regexp_replace

语法：regexp_replace(target_string,pattern,substitute_string)

select regexp_replace("IloveYou","You","")

out>>>"Ilove"

3.regexp_extract

语法：regexp_extract(target_string,pattern,index)
index是选择哪个分组（即括号括起来的），从1开始。

疑难问题实战：

比如我有像这样的数据：

'[{"houses":0,"States":[{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"}]},{"houses":1,"States":[{"state":"07","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"03","inventory":0,"houseCode":"xxx","houseName":"xxxx"}]}]'

我要从里面取出每个house的状态，然后综合各个house的状态，按照一定的逻辑去判断整体状态是什么。所以我要用正则去匹配。然后因为house的数目是不定的，有的数据的house可能有1一个，最多的有4个，所以要匹配出所有的state。因为数据库里一条数据只有一个，只能放一条匹配的数据，所以正则匹配的只显示第一次匹配的结果。
本来一筹莫展想额外建函数来写的，但是在公司小哥哥的指导下找到了解决方法。我们可以开多个列来储存多个匹配结果，这里最多有4个，就开4个。因为正则是优先匹配第一个，所以我在第一个表达式里只写一个组，第二表达式写两个组，以此类推。
然后把4个结果按照一定的逻辑判断整体的状态（用case when）。

select
regexp_extract(origin_info,'state":"(.*?)","inventory',1) as h1,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory',2) as h2,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',3) as h3,
    regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',4) as h4
from test;