大数据大数据,机器学习,人工智能

Spark中DataFrame的基本操作

2019-10-07  本文已影响0人  环境与方法

前言

在使用spark进行数据处理时,之前RDD使用较多,目前随着任务复杂度增加,有schema的DataFrame更加简洁便利。现在总结一下df的基本操作,作为备忘。

DataFrame是什么

DataFrame是一种以RDD为基础的分布式数据集,每一列都有自己的名称和数据类型。 Spark中的DataFrame属于比较基本的数据结构,在进行数据处理时,用DataFrame进行处理较RDD相比,速度更快,是带有schema的RDD数据结构。反观RDD,在进行使用时,只有内部分区,无法知道数据元素内部的数据结构。

下面介绍一下在使用DataFrame(df)时的基本操作:

1.在服务器上启动spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
2.生成DF格式数据

首先我们创造一个map,用字母指向该字母为首的水果英文单词,并转换为array形式。再将该array转换为rdd,并转换为名称为"alphabet"和"fruit"的df
同理生成名称为"alphabet"和"fruit1"的df1,为之后两个df做join操作作准备

scala> val test = Map("a" -> "apple", "b" -> "banana", "c" -> "cherry", "d" -> "durian", "g" -> "grape").toArray
test: Array[(String, String)] = Array((a,apple), (b,banana), (g,grape), (c,cherry), (d,durian))

scala> val df = sc.parallelize(test).toDF("alphabet", "fruit")
df: org.apache.spark.sql.DataFrame = [alphabet: string, fruit: string]

scala> val test1 = Map("a" -> "apple1", "b" -> "banana1", "c" -> "cherry1", "d" -> "durian1", "g" -> "grape1").toArray
test1: Array[(String, String)] = Array((a,apple1), (b,banana1), (g,grape1), (c,cherry1), (d,durian1))

scala> val df1 = sc.parallelize(test1).toDF("alphabet", "fruit1")
df1: org.apache.spark.sql.DataFrame = [alphabet: string, fruit1: string]
3.利用show()查看当前的df数据
scala> df.show()
+--------+------+
|alphabet| fruit|
+--------+------+
|       a| apple|
|       b|banana|
|       g| grape|
|       c|cherry|
|       d|durian|
+--------+------+
4.增加一列fruit_copy,值和fruit一样
scala> df.withColumn("fruit_copy", df("fruit")).show()
+--------+------+----------+
|alphabet| fruit|fruit_copy|
+--------+------+----------+
|       a| apple|     apple|
|       b|banana|    banana|
|       g| grape|     grape|
|       c|cherry|    cherry|
|       d|durian|    durian|
+--------+------+----------+
5.将其中一列重命名,将fruit_copy重命名为fruit_double
scala> df.withColumn("fruit_copy", df("fruit")).withColumnRenamed("fruit_copy", "fruit_douuble").show()
+--------+------+-------------+
|alphabet| fruit|fruit_douuble|
+--------+------+-------------+
|       a| apple|        apple|
|       b|banana|       banana|
|       g| grape|        grape|
|       c|cherry|       cherry|
|       d|durian|       durian|
+--------+------+-------------+
6.将df和df1两个DataFrame做join left的操作,相当于rdd的leftOuterJoin,根据"alphabet"这列进行join
scala> df.join(df1, df("alphabet") === df1("alphabet"), "left").show()
+--------+------+--------+-------+
|alphabet| fruit|alphabet| fruit1|
+--------+------+--------+-------+
|       g| grape|       g| grape1|
|       d|durian|       d|durian1|
|       c|cherry|       c|cherry1|
|       b|banana|       b|banana1|
|       a| apple|       a| apple1|
+--------+------+--------+-------+
7.explode操作, 将"fruit"列按照"a"分割。两种写法:

方法一:

scala> df.withColumn("split", split(col("fruit"), "a")).withColumn("split", explode(col("split"))).show()
+--------+------+------+
|alphabet| fruit| split|
+--------+------+------+
|       a| apple|      |
|       a| apple|  pple|
|       b|banana|     b|
|       b|banana|     n|
|       b|banana|     n|
|       b|banana|      |
|       g| grape|    gr|
|       g| grape|    pe|
|       c|cherry|cherry|
|       d|durian|  duri|
|       d|durian|     n|
+--------+------+------+

方法二:

scala> df.explode("fruit", "split") { it: String => it.split("a")}.show()
+--------+------+------+
|alphabet| fruit| split|
+--------+------+------+
|       a| apple|      |
|       a| apple|  pple|
|       b|banana|     b|
|       b|banana|     n|
|       b|banana|     n|
|       b|banana|      |
|       g| grape|    gr|
|       g| grape|    pe|
|       c|cherry|cherry|
|       d|durian|  duri|
|       d|durian|     n|
+--------+------+------+
上一篇下一篇

猜你喜欢

热点阅读