188、Spark 2.0之Dataset开发详解-基础操作:持

2019-02-11  本文已影响0人  ZFH__ZJ

基础操作

持久化

cache、persist
持久化,如果要对一个dataset重复计算两次的话,那么建议先对这个dataset进行持久化再进行操作,避免重复计算

创建临时视图

createTempView、createOrReplaceTempView
创建临时视图,主要是为了,可以直接对数据执行sql语句

获取执行计划

explain
获取spark sql的执行计划
dataframe/dataset,比如执行了一个sql语句获取的dataframe,实际上内部包含一个logical plan,逻辑执行计划
实际执行的时候,首先会通过底层的catalyst optimizer,生成物理执行计划,比如说会做一些优化,比如push filter
还会通过whole-stage code generation技术去自动化生成代码,提升执行性能

查看schema

printSchema

写数据到外部存储

write

dataset与dataframe互相转换

as、toDF

代码

object BasicOperation {

  case class Employee(name: String, age: Long, depId: Long, gender: String, salary: Long)

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession
      .builder()
      .appName("BasicOperation")
      .master("local")
      .getOrCreate()

    import sparkSession.implicits._

    val employeePath = this.getClass.getClassLoader.getResource("employee.json").getPath

    val employeeDF = sparkSession.read.json(employeePath)
    println(employeeDF.count())
    println(employeeDF.count())
    employeeDF.cache()
    println(employeeDF.count())


    employeeDF.createOrReplaceTempView("employees")
    sparkSession.sql("select * from employees where age < 30").show()

    sparkSession.sql("select * from employees where age < 30").explain()

    employeeDF.printSchema()

    val employeeWithAgeGreaterThen30DF = sparkSession.sql("select * from employee where age > 30")
    employeeWithAgeGreaterThen30DF.write.json("C:\\Users\\Administrator\\Desktop\\employeeWithAgeGreaterThen30DF.json")

    val employeeDS = employeeDF.as[Employee]
    employeeDS.show()
    employeeDS.printSchema()

    val employeeDF2 = employeeDS.toDF()
    employeeDF2.show()
  }
}
上一篇下一篇

猜你喜欢

热点阅读