188、Spark 2.0之Dataset开发详解-基础操作:持
2019-02-11 本文已影响0人
ZFH__ZJ
基础操作
持久化
cache、persist
持久化,如果要对一个dataset重复计算两次的话,那么建议先对这个dataset进行持久化再进行操作,避免重复计算
创建临时视图
createTempView、createOrReplaceTempView
创建临时视图,主要是为了,可以直接对数据执行sql语句
获取执行计划
explain
获取spark sql的执行计划
dataframe/dataset,比如执行了一个sql语句获取的dataframe,实际上内部包含一个logical plan,逻辑执行计划
实际执行的时候,首先会通过底层的catalyst optimizer,生成物理执行计划,比如说会做一些优化,比如push filter
还会通过whole-stage code generation技术去自动化生成代码,提升执行性能
查看schema
printSchema
写数据到外部存储
write
dataset与dataframe互相转换
as、toDF
代码
object BasicOperation {
case class Employee(name: String, age: Long, depId: Long, gender: String, salary: Long)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName("BasicOperation")
.master("local")
.getOrCreate()
import sparkSession.implicits._
val employeePath = this.getClass.getClassLoader.getResource("employee.json").getPath
val employeeDF = sparkSession.read.json(employeePath)
println(employeeDF.count())
println(employeeDF.count())
employeeDF.cache()
println(employeeDF.count())
employeeDF.createOrReplaceTempView("employees")
sparkSession.sql("select * from employees where age < 30").show()
sparkSession.sql("select * from employees where age < 30").explain()
employeeDF.printSchema()
val employeeWithAgeGreaterThen30DF = sparkSession.sql("select * from employee where age > 30")
employeeWithAgeGreaterThen30DF.write.json("C:\\Users\\Administrator\\Desktop\\employeeWithAgeGreaterThen30DF.json")
val employeeDS = employeeDF.as[Employee]
employeeDS.show()
employeeDS.printSchema()
val employeeDF2 = employeeDS.toDF()
employeeDF2.show()
}
}