Spark术语
- Application
User program built on Spark Consists of a driver program and executors on the cluster
构建在spark上的用户应用程序(eg.idea上的scala object)
由在集群上的一个driver program和多个executors所组成
- Application jar
A jar containing the user's Spark application In some cases users will want to create an "uber jar"-- --containing their application along with its dependencies The user's jar should never include Hadoop or Spark libraries, -- --however, these will be added at runtime.
一个jar包含了用户的Spark应用程序
- Driver program
The process running the main() function of the application and creating the SparkContext
运行应用main()方法的进程,并且能创建SparkContext
所以在main方法里创建SparkContext的程序就是driver program
- Cluster manager
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
在集群上申请资源的外部的服务
好处:代码开发过程中不用关注代码运行在哪里
运行各种模式下,其代码都是相同的
- Deploy mode
Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster In "client" mode, the submitter launches the driver outside of the cluster
分辨driver process运行在哪里, 在集群模式, 框架在集群内启动框架,
在client模式, submitter在cluster外面启动driver
- Worker node
Any node that can run application code in the cluster
在集群上能够运行应用的node被称为Worker node
- Executor
A process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage across them Each application has its own executors
启动一个服务于worker node(eg. node manager)上的进程,
运行在container里运行tasks(map或filter),
并且可以将数据放于内存中或者是跨节点的磁盘上。
每个应用程序有其独立的executors
- Task
A unit of work that will be sent to one executor
发送给executor的工作单元
- Job
A parallel computation consisting of multiple tasks , that gets spawned in response to a Spark action (e.g. save, collect)
由多个task组成的一个并行计算, 一个action触发一个job
简单解释, 调用一个action(如collection算子)就是一个job
- Stage
Each job gets divided into smaller sets of tasks called stages that depend on each other similar to the map and reduce stages in MapReduce
1个job会被分成多个stage
遇到一个shuffle就产生新的stage