简单说说spark中的rdd

2019-12-27 本文已影响0人小草莓子桑

RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。今天来简单说说Spark中的RDD。RDD的API放到下篇再详细说

RDD简介

RDD可以看作是Spark的一个对象，它本身运行于内存中，如读文件是一个RDD，对文件计算是一个RDD，结果集也是一个RDD ，不同的分片、数据之间的依赖、key-value类型的map数据都可以看做RDD。(注意：来自百度百科)，这里，RDD什么的就不多说了，直接说说RDD的两种操作吧。

RDD两个类型（算子）操作：Transformation和Action

Transformation

直译的话就是转移，主要作用是通过操作用一个RDD生成一个新的RDD。Transformation操作的代码不会被立马执行。只有代码中出现action类型操作时，代码才会真正的被执行。这种设计让Spark更加有效率地运行。

Transformation API

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.

Transformation	Meaning	汉语解释
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.	通过函数func先聚集各分区的数据集，再聚集分区之间的数据，func接收两个参数，返回一个新值，新值再做为参数继续传递给函数func，直到最后一个元素
collect()	Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.	将数据集的所有元素作为数组返回给驱动程序。这通常在过滤器或其他返回足够小的数据子集的操作比较有用
count()	Return the number of elements in the dataset.	返回数据集中的元素个数
first()	Return the first element of the dataset (similar to take(1)).	返回数据集中的第一个元素
take(n)	Return an array with the first n elements of the dataset.	返回一个由数据集的前n个元素组成的数组
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.	返回一个数组，其中包含数据集num元素的随机样本，可以替换，也可以不替换，还可以预先指定一个随机数生成器种子。
*takeOrdered(n, [ordering])	Return a new dataset that contains the union of the elements in the source dataset and the argument.	对源RDD和参数RDD求并集后返回一个新的RDD
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in the source dataset and the argument.	对源RDD和参数RDD求交集后返回一个新的RDD
distinct([numPartitions]))	Return a new dataset that contains the distinct elements of the source dataset.	去重返回新的RDD
groupByKey([numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.	通过key groupBy后，返回一个新的(K, Iterable<V>) RDD,注意：如果为了对每个键执行聚合（如总和或平均值）而进行分组，那么使用reduceByKey或aggregateByKey将获得更好的性能。注意：默认情况下，输出中的并行级别取决于父RDD的分区数。可以传递可选的numPartitions参数来设置不同数量的任务。
reduceByKey(func, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.	当在（K，V）对的数据集上调用时，返回一个（K，V）对的数据集，其中每个键的值使用给定的reduce函数func进行聚合，该函数必须是（V，V）=>V类型。与groupByKey一样，reduce任务的数量可以通过可选的第二个参数进行配置。
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.	当对（K，V）对的数据集调用时，返回（K，U）对的数据集，其中使用给定的组合函数和中性“零”值聚合每个键的值。允许不同于输入值类型的聚合值类型，同时避免不必要的分配。与groupByKey一样，reduce任务的数量可以通过可选的第二个参数进行配置。
sortByKey([ascending], [numPartitions])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.	当对K实现有序的（K，V）对数据集调用时，返回按键升序或降序排序的（K，V）对数据集，如boolean升序参数中指定的。
join(otherDataset, [numPartitions])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.	当对（K，V）和（K，W）类型的数据集调用时，返回一个（K，（V，W））对的数据集，其中包含每个键的所有元素对。外部联接通过leftOuterJoin、rightOuterJoin和fullOuterJoin支持。
cogroup(otherDataset, [numPartitions])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.	当对（K，V）和（K，W）类型的数据集调用时，返回（K，（Iterable<V>，Iterable<W>）元组的数据集。此操作也称为groupWith。
cartesian(otherDataset)	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).	在T和U类型的数据集上调用时，返回（T，U）对（所有元素对）的数据集。
pipe(command, [envVars])	Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.	通过shell命令（例如Perl或bash脚本）对RDD的每个分区进行管道传输。RDD元素被写入进程的stdin，输出到其stdout的行作为字符串的RDD返回。
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.	将RDD中的分区数减少到numPartitions。有助于在筛选大型数据集之后更有效地运行操作。
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.	随机地重新整理RDD中的数据，以创建更多或更少的分区，并在它们之间进行平衡。这总是通过网络洗牌所有数据。
repartitionAndSortWithinPartitions(partitioner)	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.	根据给定的分区器重新分区RDD，并在每个生成的分区中，按其键对记录进行排序。这比调用重新分区然后在每个分区内进行排序更有效，因为它可以将排序向下推到shuffle机器中。

Action

当程序遇到action时，代码就会被真正执行，每一段spark代码里面至少需要有一个action操作。

Actions API

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)

Action	Meaning	汉语解释
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.	通过函数func先聚集各分区的数据集，再聚集分区之间的数据，func接收两个参数，返回一个新值，新值再做为参数继续传递给函数func，直到最后一个元素
collect()	Return a new dataset formed by selecting those elements of the source on which funcreturns true.	在驱动程序中将数据集的所有元素作为数组返回。这通常在过滤器或其他返回足够小的数据子集的操作之后才有用。
count()	Return the number of elements in the dataset.	返回数据集的元素个数
first()	Return the first element of the dataset (similar to take(1)).	返回数据集中第一个元素
take(n)	Return an array with the first n elements of the dataset.	返回数据集中的前n个元素
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.	返回一个数组，其中包含数据集num元素的随机样本，可以替换，也可以不替换，还可以预先指定一个随机数生成器种子。
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.	使用RDD的自然顺序或自己实现的顺序比较规则返回RDD的前n个元素。
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.	将数据集元素作为文本文件（或一组文本文件）写入本地文件系统、HDFS或任何其他Hadoop支持的文件系统中的给定目录中。Spark将调用每个元素上的toString将其转换为文件中的一行文本。
saveAsSequenceFile(path) (Java and Scala)	Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).	在本地文件系统、HDFS或任何其他Hadoop支持的文件系统中的给定路径中将数据集的元素作为Hadoop SequenceFile编写。这在实现Hadoop可写接口的键值对的rdd上可用。在Scala中，它也适用于隐式可转换为可写的类型（Spark包括对Int、Double、String等基本类型的转换）。
saveAsObjectFile(path) (Java and Scala)	Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().	使用Java序列化以简单的格式编写数据集的元素，然后可以使用SparkContext.objectFile（）加载这些元素。
countByKey()	Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.	仅在（K，V）类型的RDD上可用。返回（K，Int）对的哈希映射和每个键的计数。
foreach(func)	Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.	对数据集的每个元素运行函数func。这通常是为了副作用，如更新累加器或与外部存储系统交互。注意：在foreach（）之外修改累加器以外的变量可能会导致未定义的行为。有关详细信息，请参见了解闭包。

spark中的rdd就简单给大家说到这，RDD的API放到下篇再详细说，欢迎大家来交流，指出文中一些说错的地方，让我加深认识，愿大家没有bug，谢谢！