Spark聚合下推思路以及demo
Spark原本预计在2.3版本实现聚合下推,虽然不知道是何原因最终没有能够在2.3版本最终实现,但是因为工作需要,必须要从聚合函数下手优化Spark SQL,遂思考之实现之。
一篇有意义的参考文章
网上有个牛人想在2.2版本实现聚合下推并提交代码到Spark,结果在pull request里被拒绝了,Spark的人说他们在2.3会实现一套新的DataSource API,即DataSource API v2,所以让他不要这么执着于提交这个代码,我也是很醉。。。这里贴出他的博文:
该大牛是基于物理计划实现的下推,局限性比较大。所以我参考了他的思路,从逻辑计划和物理计划两个方面都做了一些优化。这里只讲逻辑计划的下推。下推,必然最后是推到数据源层,而Spark没有实现DataSource的聚合数据源的接口,这里可以参考下刚刚分享博文实现的AggregatedFilteredScan
接口,我也是基于这个接口的做法实现的。
下推的意义
无论是传统的谓词下推,还是聚合下推,意义都在于将一些操作推到数据源层,这样从数据源里返回的数据就会极大减少。磁盘读写和网络开销都会降低,性能会得到提升。
难点的实现思路
聚合下推的最大难点,我认为是遇到了join,当join的on的两列不属于group列或者aggregate列该肿么办。最开始我认为这种情况可能没有办法下推,因为这样势必要在在group列中加上了原本不属于group的某一join列,这样会影响聚合的结果并且会多一次聚合。但是经过大神提点,其实这样也是可以下推的,原因有二:
- 即使多一个聚合节点,SQL执行的结果也是对的,也就是最终结果来看其实不应该聚合结果。
- 一般来说,join on的两列不可能有相同的行数,如果行数相同,那么按照数据库的设计规范,这两张表就应该Union成一张表。所以多的这个聚合节点,也是会减少数据源的数据传输的。
这两点在后面的例子会有展示。
一个下推到join的简单思路以及结果
数据源准备
使用写一个Spark DataSource的随手笔记的Scott数据源。
初始SQL以及逻辑计划
SQL:
SELECT AVG(salary), deptName
FROM emp
JOIN dept
ON emp.deptNo = dept.deptNo
GROUP BY deptName;
LogicalPlan:
Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
+- Project [salary#6, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Project [deptNo#5, salary#6]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
逐步下推
- Step 1,指针在最上层,aggregate节点推到其child节点project的下面,同时将project里的salary#6替换成avg(salary)#19:
Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Project [deptNo#5, salary#6]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
- Step 2,指针在第二层,这次是下推一层后的aggregate节点,搜索join节点下面的左右子project节点,看哪个有salary#6,往salary#6所在的子节点上方添加一个以join字段为group by条件的聚合节点,假设新生成exprId是20:
Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
: +- Project [deptNo#5, salary#6]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
- Step 3,指针在第4层的aggregate,与1类似,将aggregate推到project下方,并将project中的salary#6替换成salary#20:
Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Project [deptNo#5, salary#20]
: +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
- Step 2,此时aggregate的子节点是join,将avg(salary)#19的exprId(19)生成新id,假设这里变成avg(salary)#20:
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Project [deptNo#5, salary#6]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
- Step 3,搜索join节点下面的左右子project节点,看哪个有salary#6,往salary#6所在的子节点上方添加一个以join字段为group by条件的聚合节点,假设新生成exprId是21:
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#21 as double)) AS avg(salary)#20, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
: +- Project [deptNo#5, salary#6]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
- Step 4,与1类似,将aggregate推到project下方,并将project中的salary#6替换成salary#21:
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
+- Join Inner, (deptNo#5 = deptNo#7)
:- Project [deptNo#5, salary#21]
: +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
: +- Filter isnotnull(deptNo#5)
: +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#7, deptName#8]
+- Filter isnotnull(deptNo#7)
+- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
这时aggregate-filter-relation的组合就会调用到上面提到的AggregatedFilteredScan
接口,调用到数据源的buildscan()方法。
结果查看
这个下推,其实等价于下面的SQL和执行计划:
SQL:
SELECT AVG(salary), deptName
FROM
(SELECT AVG(salary), deptNo FROM emp GROUP BY deptNo) a
JOIN dept
ON a.deptNo = dept.deptNo
GROUP BY deptName
LogicalPlan:
Project [avg(avgsalary)#21, deptName#9]
+- Aggregate [deptName#9], [avg(avgsalary#2) AS avg(avgsalary)#21, deptName#9]
+- Join Inner, (deptNo#6 = deptNo#8)
:- Project [deptNo#6, avgsalary#2]
: +- Aggregate [deptNo#6], [avg(cast(salary#7 as double)) AS avgsalary#2, deptNo#6]
: +- Filter isnotnull(deptNo#6)
: +- Relation[empNo#3,empName#4,mgr#5,deptNo#6,salary#7] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
+- Project [deptNo#8, deptName#9]
+- Filter isnotnull(deptNo#8)
+- Relation[deptNo#8,deptName#9,loc#10] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
在没有聚合下推的情况下,返回的join两侧的数据源是:
data without aggregate function:
List(ListBuffer(20, 800.0), ListBuffer(20, 3000.0), ListBuffer(20, 2975.0), ListBuffer(30, 1600.0), ListBuffer(30, 1250.0), ListBuffer(30, 2950.0), ListBuffer(10, 5000.0))
data without aggregate function:
List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))
而在下推的情况下,返回的join两侧数据源是:
data with aggregate function:
ListBuffer(List(30, 5800.0, 3), List(20, 6775.0, 3), List(10, 5000.0, 1))
data without aggregate function:
List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))
可以看到,下推和没下推,在一侧数据源中拿到的数据有明显减少。这还只是数据量在不到20的情况下。在数据量大的情况下,那么聚合下推效果会更好。
二者查询结果都是:
+------------------+----------+
|avg(salary) |deptName |
+------------------+----------+
|1933.3333333333333|sales |
|5000.0 |accounting|
|2258.3333333333335|research |
+------------------+----------+