apache griffin 0.6 measure模块quic
2020-12-24 本文已影响0人
侧耳倾听y
apache griffin是数据质量监控解决方案,我只在市面上找到了这一款开源软件,记录一下使用过程。
依赖环境
- jdk8
- hadoop
- spark 2.4.7
hadoop和spark照着网上教程安装即可。
hadoop安装参考:https://www.jianshu.com/p/3859f57aa545
spark安装参考:https://cloud.tencent.com/developer/article/1020647
我把hadoop与spark就装在mac上。
这里要注意spark没有使用最新的版本,因为我在过程中发现,高版本spark使用的scala版本是2.12,而apache griffin中大部分jar包依赖scala.binary.version版本是2.11,这样会有问题。我尝试修改apache griffin的pom文件中的spark和scala版本,可能姿势不对,还是无法运行。最后投降了,装了2.4.7的spark。
对measure模块进行打包
先去官网把griffin代码下载下来,然后把代码导入idea,代码结构就不截图了。我们选择measure里面自带的配置作为例子来跑,数据源是文件类型(avro)。pom文件需要再加上以下依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.7</version>
</dependency>
修改spark的版本
<spark.version>2.4.7</spark.version>
使用maven打个包
提交spark任务
提交之前要先修改一下配置文件。我选择使用env-batch.json和config-batch.json这两个文件。
env-batch.json是spark与结果下沉的配置
{
"spark": {
"log.level": "WARN",
"config": {
"spark.master": "local[*]"
}
},
"sinks": [
{
"name": "console",
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"name" : "hdfs",
"type" : "HDFS",
"config" : {
"path" : "hdfs://localhost:9000/user/root/griffin/persist",
"max.persist.lines" : 10000,
"max.lines.per.file" : 10000
}
}
],
"griffin.checkpoint": []
}
config-batch.json是数据源和规则的配置
{
"name": "accu_batch",
"process.type": "batch",
"data.sources": [
{
"name": "source",
"baseline": true,
"connector": {
"type": "avro",
"version": "1.7",
"config": {
"file.name": "hdfs://localhost:9000/user/root/griffin/data/users_info_src.avro"
}
}
},
{
"name": "target",
"connector": {
"type": "avro",
"version": "1.7",
"config": {
"file.name": "hdfs://localhost:9000/user/root/griffin/data/users_info_target.avro"
}
}
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"out.dataframe.name": "accu",
"rule": "source.user_id = target.user_id AND upper(source.first_name) = upper(target.first_name) AND source.last_name = target.last_name AND source.address = target.address AND source.email = target.email AND source.phone = target.phone AND source.post_code = target.post_code"
}
]
},
"sinks": [
"console", "hdfs"
]
}
在spark的bin目录下,输入提交任务命令
spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
--driver-memory 1g --executor-memory 1g --num-executors 2 \
/Users/xxx/Downloads/griffin-0.6.0/measure/target/measure-0.6.0.jar \
/Users/xxx/Downloads/griffin-0.6.0/measure/src/main/resources/env-batch.json /Users/xxx/Downloads/griffin-0.6.0/measure/src/main/resources/config-batch.json
输入之后可以正常运行,结果会生成在配置的地方