spark核心概念
2019-08-16 本文已影响0人
shone_shawn
Application:基于Spark的应用程序 = 1 driver + executors
User program built on Spark. //用户程序构建在spark上
Consists of a driver program and executors on the cluster.//
spark0402.py
pyspark/spark-shell
Driver program
The process running the main() function of the application
creating the SparkContext //是一个进程,用来运行应用的main方法,来创建一个sparkcontext
Cluster manager //集群的资源获取管理
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) //一个外部的服务,从集群上获取资源()
spark-submit --master local[2]/spark://hadoop000:7077/yarn
Deploy mode //部署模式
Distinguishes where the driver process runs. //区分你的driver进程运行在哪里
In "cluster" mode, the framework launches the driver inside of the cluster. //你的框架会启动你的driver在集群里,并且运行在am里
In "client" mode, the submitter launches the driver outside of the cluster. //你的提交者会启动你的driver在集群外面,在本节点启动
Worker node //工作节点
Any node that can run application code in the cluster //运行你的应用程序在集群里
standalone: slave节点 slaves配置文件
yarn: nodemanager
Executor //一个进程,运行在worker node上面的,运行task,缓存数据,而且每个应用程序有一组独立的executor
A process launched for an application on a worker node
runs tasks
keeps data in memory or disk storage across them //能够对数据进行缓存,存到内存或者磁盘里面
Each application has its own executors. //每个应用程序都有它独立的executor
申请资源,通过cluster manager来申请,可以指定yarn,standalone,local等等,可以采用本地client方法,也可以cluster(deploy mode)
Task //工作单元,从driver发起,通过网络,会被发送到executor去执行
A unit of work that will be sent to one executor
Job //一个并行计算,这个计算由多个task构成
A parallel computation consisting of multiple tasks that //job包含了多个task
gets spawned in response to a Spark action (e.g. save, collect); //lazy的,遇到action了才会到集群上运行变成job
you'll see this term used in the driver's logs.
一个action对应一个job
Stage
Each job gets divided into smaller sets of tasks called stages //一个job会被拆分成很多小的任务集,叫stage
that depend on each other //它们之间是有相互的依赖的
(similar to the map and reduce stages in MapReduce); //类似map和reduce的stage
you'll see this term used in the driver's logs. //你能够在driver的日志里查看到
一个stage的边界往往是从某个地方取数据开始,到shuffle的结束
一个应用程序由1个driver和多个executor组成,executor运行在worker node上面,executor上面会有一堆的task,这些task是从driver发过来,这些task是遇到一个job的时候触发的,什么是job呢,遇到action时触发的,一个job又会被拆分成很多个task子集,就是stage,task是最小运行单元,在运行时通过不同的运行模式(cluster manager)指定deploy mode到底是client还是cluster来运行
一个job是action触发的,然后一个job里面可能会有一到多个stage,然后stage里有一堆task,这些task运行在executor里面
executor运行在worker node上面
Spark Cache
rdd.cache(): StorageLevel
cache它和tranformation: lazy 没有遇到action是不会提交作业到spark上运行的
如果一个RDD在后续的计算中可能会被使用到,那么建议cache
cache底层调用的是persist方法,传入的参数是:StorageLevel.MEMORY_ONLY
cache=persist
unpersist: 立即执行的
窄依赖:一个父RDD的partition之多被子RDD的某个partition使用一次
宽依赖:一个父RDD的partition会被子RDD的partition使用多次,有shuffle
hello,1
hello,1 hello
world,1
hello,1 world
world,1