Apache Spark 简介

2017-07-13 本文已影响0人旺达丨

Spark 是啥

Apache Spark^2.2.0 is a fast and general engine for large-scale data processing.

Spark 有多快

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Logistic regression in Hadoop and Spark

Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.

Spark 为啥快

部署架构图

cluster-overview 1

Glossary

代码中寻找对象


package org.apache.spark.examples

import scala.math.random

import org.apache.spark.sql.SparkSession

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .appName("Spark Pi")
      .getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _) 
    println("Pi is roughly " + 4.0 * count / (n - 1))
    spark.stop()
  }
}

Standalone 模式下对象的架构

cluster-Standalone

Job 提交过程

YARN模式下的部署架构

Spark-Architecture-On-YARN

计算模型

GroupByTest 实例分析

package org.apache.spark.examples

import java.util.Random

import org.apache.spark.sql.SparkSession

/**
 * Usage: GroupByTest [numMappers] [numKVPairs] [KeySize] [numReducers]
   bin/run-example GroupByTest 100 10000 1000 36
 */
object GroupByTest {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .appName("GroupBy Test")
      .getOrCreate()

    val numMappers = if (args.length > 0) args(0).toInt else 2
    val numKVPairs = if (args.length > 1) args(1).toInt else 1000
    val valSize = if (args.length > 2) args(2).toInt else 1000
    val numReducers = if (args.length > 3) args(3).toInt else numMappers

    val pairs1 = spark.sparkContext.parallelize(0 until numMappers, numMappers).flatMap { p =>
      val ranGen = new Random
      val arr1 = new Array[(Int, Array[Byte])](numKVPairs)
      for (i <- 0 until numKVPairs) {
        val byteArr = new Array[Byte](valSize)
        ranGen.nextBytes(byteArr)
        arr1(i) = (ranGen.nextInt(Int.MaxValue), byteArr)
      }
      arr1
    }.cache()
    // Enforce that everything has been calculated and in cache
    pairs1.count()

    println(pairs1.groupByKey(numReducers).count())

    spark.stop()
  }
}

数据流

group by

逻辑执行图

逻辑执行图描述的是 job 的数据流：job 会经过哪些 transformation()，中间生成哪些 RDD 及 RDD 之间的依赖关系。

job 逻辑执行图

物理执行DAG

物理执行 DAG 图

Apache Spark 简介

Spark 是啥

Spark 有多快

Spark 为啥快

部署架构图

代码中寻找对象

Standalone 模式下对象的架构

Job 提交过程

YARN模式下的部署架构

计算模型

GroupByTest 实例分析

数据流

逻辑执行图

物理执行DAG

猜你喜欢

热点阅读