大数据大数据点滴玩转大数据

Spark聚合下推思路以及demo

2018-08-13  本文已影响5人  orisonchan

Spark原本预计在2.3版本实现聚合下推,虽然不知道是何原因最终没有能够在2.3版本最终实现,但是因为工作需要,必须要从聚合函数下手优化Spark SQL,遂思考之实现之。

一篇有意义的参考文章

网上有个牛人想在2.2版本实现聚合下推并提交代码到Spark,结果在pull request里被拒绝了,Spark的人说他们在2.3会实现一套新的DataSource API,即DataSource API v2,所以让他不要这么执着于提交这个代码,我也是很醉。。。这里贴出他的博文:

SparkSQL如何实现聚合下推

该大牛是基于物理计划实现的下推,局限性比较大。所以我参考了他的思路,从逻辑计划和物理计划两个方面都做了一些优化。这里只讲逻辑计划的下推。下推,必然最后是推到数据源层,而Spark没有实现DataSource的聚合数据源的接口,这里可以参考下刚刚分享博文实现的AggregatedFilteredScan接口,我也是基于这个接口的做法实现的。

下推的意义

无论是传统的谓词下推,还是聚合下推,意义都在于将一些操作推到数据源层,这样从数据源里返回的数据就会极大减少。磁盘读写和网络开销都会降低,性能会得到提升。

难点的实现思路

聚合下推的最大难点,我认为是遇到了join,当join的on的两列不属于group列或者aggregate列该肿么办。最开始我认为这种情况可能没有办法下推,因为这样势必要在在group列中加上了原本不属于group的某一join列,这样会影响聚合的结果并且会多一次聚合。但是经过大神提点,其实这样也是可以下推的,原因有二:

  1. 即使多一个聚合节点,SQL执行的结果也是对的,也就是最终结果来看其实不应该聚合结果。
  2. 一般来说,join on的两列不可能有相同的行数,如果行数相同,那么按照数据库的设计规范,这两张表就应该Union成一张表。所以多的这个聚合节点,也是会减少数据源的数据传输的。

这两点在后面的例子会有展示。

一个下推到join的简单思路以及结果

数据源准备

使用写一个Spark DataSource的随手笔记的Scott数据源。

初始SQL以及逻辑计划

SQL:

SELECT AVG(salary), deptName 
FROM emp 
JOIN dept 
ON emp.deptNo = dept.deptNo 
GROUP BY deptName;

LogicalPlan:

Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
+- Project [salary#6, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

逐步下推

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#20]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#21 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))
Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#21]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

这时aggregate-filter-relation的组合就会调用到上面提到的AggregatedFilteredScan接口,调用到数据源的buildscan()方法。

结果查看

这个下推,其实等价于下面的SQL和执行计划:

SQL:

SELECT AVG(salary), deptName 
FROM 
(SELECT AVG(salary), deptNo FROM emp GROUP BY deptNo) a 
JOIN dept 
ON a.deptNo = dept.deptNo 
GROUP BY deptName

LogicalPlan:

Project [avg(avgsalary)#21, deptName#9]
+- Aggregate [deptName#9], [avg(avgsalary#2) AS avg(avgsalary)#21, deptName#9]
   +- Join Inner, (deptNo#6 = deptNo#8)
      :- Project [deptNo#6, avgsalary#2]
      :  +- Aggregate [deptNo#6], [avg(cast(salary#7 as double)) AS avgsalary#2, deptNo#6]
      :     +- Filter isnotnull(deptNo#6)
      :        +- Relation[empNo#3,empName#4,mgr#5,deptNo#6,salary#7] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#8, deptName#9]
         +- Filter isnotnull(deptNo#8)
            +- Relation[deptNo#8,deptName#9,loc#10] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

在没有聚合下推的情况下,返回的join两侧的数据源是:

data without aggregate function: 
  List(ListBuffer(20, 800.0), ListBuffer(20, 3000.0), ListBuffer(20, 2975.0), ListBuffer(30, 1600.0), ListBuffer(30, 1250.0), ListBuffer(30, 2950.0), ListBuffer(10, 5000.0))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

而在下推的情况下,返回的join两侧数据源是:

data with aggregate function: 
  ListBuffer(List(30, 5800.0, 3), List(20, 6775.0, 3), List(10, 5000.0, 1))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

可以看到,下推和没下推,在一侧数据源中拿到的数据有明显减少。这还只是数据量在不到20的情况下。在数据量大的情况下,那么聚合下推效果会更好。

二者查询结果都是:

+------------------+----------+
|avg(salary)       |deptName  |
+------------------+----------+
|1933.3333333333333|sales     |
|5000.0            |accounting|
|2258.3333333333335|research  |
+------------------+----------+
上一篇下一篇

猜你喜欢

热点阅读