Spark运行原理及术语

2019-05-12  本文已影响0人  喵星人ZC

一、Glossary

The following table summarizes terms you’ll see used to refer to cluster concepts(下表总结了您将看到用于引用群集概念的术语):

Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

术语总结:

Spark的Job、Stage、Task、Partition之间的关系:

一个Job由一个或多个Stage组成
一个Stage由多个task组成
task数等于Satage中最后一个RDD的的partition数

1Job --> N Stage --> N Task

二、Spark运行原理架构图


Spark执行流程图.png

There are several useful things to note about this architecture(关于这种架构有几点有用的注意事项):

内部详细流程图png

1、Client通过spark-submit提交Spark作业,根据Deploy mode参数在相对应的位置初始化SparkContext(Driver Manager),即Spark的运行环境。SparkContext创建DAG Scheduler和Task Scheduer,Driver根据应用程序执行代码,将整个程序根据action算子划分成多个job。每个job内部构建DAG图,DAG Scheduler将DAG图划分为多个stage,同时每个stage内部划分为多个task,DAG Scheduler将taskset传给Task Scheduer,Task Scheduer负责集群上task的调度

2、Driver根据sparkcontext中的资源需求向resource manager申请资源,包括executor数及内存资源。

3、资源管理器收到请求后在满足条件的work node(NM)节点上创建executor进程

4、Executor创建完成后会向driver反向注册,以便driver可以分配task给他执行

5、当程序执行完后,driver向resource manager注销所申请的资源

上一篇 下一篇

猜你喜欢

热点阅读