RDD

2019-10-20 本文已影响0人自由编程

名词解释

resilient distributed dataset (RDD)

运行环境

Spark 2.4.0 is built and distributed to work with Scala 2.11 by default，此处注意，Spark和Scala的版本号要对应，否则运行的时候回发送各种未知错误。另外，Spark2.4.0对应JDK的版本最好是1.8，如果配合hadoop使用的话，hadoop的版本号可选2.7

maven依赖

#spark
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.4.0
#hdfs
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

初始化环境

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)

使用shell初始化

./bin/spark-shell --master local[4] --packages "org.example:example:0.1"

数据操作

创建RDD:Parallelized Collections

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

导入数据:External Datasets

#either a local path on the machine, or a hdfs://, s3a://, etc URI
val distFile = sc.textFile("data.txt")

RDD Operations
...未完待续

RDD

名词解释

运行环境

数据操作

猜你喜欢

热点阅读