spark文档记录

2018-08-27 本文已影响0人大大大大大大大熊

Receiver的可靠性

Reliable Receiver
将会启动receiver到source的ACK机制，确保每个记录已经被spark成功的接收和存储。（容错机制，造成一定延时，因为会阻塞直到全部ACK，默认设置receiver的存储级别是replication模式，直到replication也ACK，store（）才会返回数据）
Unreliable Receiver
不会启动receiver到source的ACK机制，存在记录的丢失

备份数据的时候，如果发生故障，缓冲中的数据不会发送AC，然后在之后source会重新发送

repartition(numPartitions)  #Changes the level of parallelism in this DStream by creating more or fewer partitions.  可以设置DStream的并行度
reduceByKey(func, [numTasks])   #使用默认的spark.default.parallelism设置的task并行度，也可以使用
numTasks参数设置

注意：本地模式运行的时候，master不能设置成local or local[1]，这个会导致只有一个cpu在接收数据，没有cpu进行处理数据，至少是2或者更多。
（Do not run Spark Streaming programs locally with master configured as local or local[1]. This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least local[2] to have more cores.）

WordCount的结果写入文件：

wordCounts.foreachRDD(rdd ->{
              if(!rdd.isEmpty()){
                 rdd.coalesce(1).saveAsTextFile("E:\\SS_result"); #文件夹
              }
          });

文件夹内容：

SS_result

打开part-00000：

part-00000

spark-submit的使用

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

一些常用的 options（选项）有 :

--class: 您的应用程序的入口点（例如。 org.apache.spark.examples.SparkPi)
--master: 集群的 master URL (例如 spark://23.195.26.187:7077)
--deploy-mode: 是在 worker 节点(cluster) 上还是在本地作为一个外部的客户端(client) 部署您的 driver(默认: client) †
--conf: 按照 key=value 格式任意的 Spark 配置属性。对于包含空格的 value（值）使用引号包 “key=value” 起来。
application-jar: 包括您的应用以及所有依赖的一个打包的 Jar 的路径。该 URL 在您的集群上必须是全局可见的，例如，一个 hdfs:// path 或者一个 file:// 在所有节点是可见的。
application-arguments: 传递到您的 main class 的 main 方法的参数，如果有的话。

spark文档记录

Receiver的可靠性

spark-submit的使用

猜你喜欢

热点阅读