Spark笔记

2017-03-24 本文已影响51人开水的杯子

It was designed to solve what MR failed to address: perf issues due to no way to re-use data between computations.
- Iterative jobs (popular in Machine Learning algorithms)
- Interactive analytics (ad hoc exploratory queries)
Resilient distributed dataset (RDD): which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. These can be cached and re-used in multiple parallel operations.
Fault tolerance achieved through lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
- a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage.

From a file in HDFS
Parallelizing a Scala collection
Transforming an existing RDD
Change the persistence of an RDD
- Cache: lazy, but leave in cache after computation. Hint only, won't force if no space.
- Save: writes it to file system
Parallel Operations
Reduce: dataset elements using an associative function to produce a result at the driver program; reduce results are only collected at one process
Collect: sends all elements of the dataset to the driver program
Foreach: passes each element through a user provided function

Broadcast variables: distribute a large piece of read-only data to distribute to all workers and not package with every closure.
Accumulators: workers can only add to it; only the driver can read it.

What is Mesos?!
Spark is built on top of Mesos [16, 15], a “cluster operating system” that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster

Internally, each RDD object implements the same simple interface, which consists of three operations:
- getPartitions: returns a list of partition IDs.
- getIterator(partition): iterates over a partition.
- getPreferredLocations(partition): used for task scheduling to achieve data locality.
Delay scheduling: send each task to one if its preferred locations.
if a node fails, its partitions are re-read from their parent datasets and eventually cached on other nodes

Broadcast variables and accumulators are implemented using classes with custom serialization formats
Broadcast variable is saved to filesystem, fetched and cached on worker node.
Accumulator is saved to filesystem. Each worker node updates own accumulator from zero and sends back for global update.

Scala compiles a class for each line typed by user including a singleton that contains the variables and functions on the line.
Previous lines are referenced via Class.getInstance.
Sparked changed this to output compiled classes into a shared filesystem and reference the singleton objects directly.

Distributed Shared Memory
- Fault tolerance: checkpointing, lineage. Lineage is better.
- Lineage: only the lost partitions need to be recomputed, and that can be done in parallel on different nodes, without requiring the program to revert to a checkpoint. No overhead if no nodes fail.
Language Integation
- Unlike DryadLINQ, Spark allows RDDs to persist in memory across parallel operations. [What does DryadLINQ do again?]
- In addition, Spark enriches the language integration model by supporting shared variables (broadcast variables and accumulators), implemented using classes with custom serialized forms.

Formally characterize the properties of RDDs and Spark’s other abstractions, and their suitability for various classes of applications and workloads.
**Enhance the RDD abstraction to allow programmers to trade between storage cost and re-construction cost. **
Define new operations to transform RDDs, including a “shuffle” operation that repartitions an RDD by a given key. Such an operation would allow us to implement group-bys and joins.
Provide higher-level interactive interfaces on top of the Spark interpreter, such as SQL and R [4] shells.