Spark2.4.0 源码编译
2019-03-17 本文已影响0人
井地儿
Spark源码编译
源码下载
从github上下载最新版本spark源码
https://github.com/apache/spark
Apache Maven(Maven编译)
基于maven的编译的版本要求如下:
Maven版本:3.5.4+
Java版本:java8+
设置maven使用内存
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
如果没有设置上述参数,可能会报错:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.11/classes...
[ERROR] Java heap space -> [Help 1]
build/mvn
Spark提供了自动化maven编译脚本,会自动下载安装编译所需要的Maven,Scala,Zinc。
编译命令
./build/mvn -DskipTests clean package
mac环境下,如果你曾从bash风格切换到zsh风格之后,没有在.zshrc中配置JAVA_HOME环境变量,可能会报错:
Cannot run program "/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/javac": error=2, No such file or directory
在 ~/.zshrc 配置文件中配置JAVA_HOME即可。
building...
stefan@localhost ~/Documents/workspace/code/spark master ./build/mvn -DskipTests clean package
Using `mvn` from path: /Users/stefan/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[INFO] Scanning for projects...
...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 4.010 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 7.204 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 6.099 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 3.870 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 8.308 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.860 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 6.418 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 5.159 s]
[INFO] Spark Project Core ................................. SUCCESS [02:01 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 5.823 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 8.543 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 21.891 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:15 min]
[INFO] Spark Project SQL .................................. SUCCESS [02:28 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:13 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 1.534 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 56.505 s]
[INFO] Spark Project REPL ................................. SUCCESS [ 5.497 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 4.034 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 6.713 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 2.156 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 9.314 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 14.136 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 3.357 s]
[INFO] Spark Avro ......................................... SUCCESS [ 5.773 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:09 min
[INFO] Finished at: 2019-03-17T11:11:29+08:00
[INFO] ------------------------------------------------------------------------
Building a Runnable Distribution(编译可运行的分布式版本)
Spark提供了自动化的分布式编译脚本:./dev/make-distribution.sh
脚本各参数含义可以通过命令
./dev/make-distribution.sh --help
查看。
✘ stefan@localhost ~/Documents/workspace/code/spark master ./dev/make-distribution.sh --help
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/didi/Documents/workspace/code/spark
+ DISTDIR=/Users/didi/Documents/workspace/code/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/didi/Documents/workspace/code/spark/build/mvn
+ (( 1 ))
+ case $1 in
+ exit_with_usage
+ echo 'make-distribution.sh - tool for making binary distributions of Spark'
make-distribution.sh - tool for making binary distributions of Spark
+ echo ''
+ echo usage:
usage:
+ cl_options='[--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>]'
+ echo 'make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>'
make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>
+ echo 'See Spark'\''s "Building Spark" doc for correct Maven options.'
See Spark's "Building Spark" doc for correct Maven options.
+ echo ''
+ exit 1
编译命令
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes
上述命令会编译spark分发包,Python pip 和R包。执行前,请确认本地安装了R。
Specifying the Hadoop Version and Enabling YARN(指定Hadoop版本并启用YARN)
可以通过hadoop.version参数指定Hadoop编译版本,如果不指定,Spark将默认使用Hadoop2.6.X版本编译。
编译命令
# Apache Hadoop 2.6.X
./build/mvn -Pyarn -DskipTests clean package
# Apache Hadoop 2.7.X and later
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.7 -DskipTests clean package
building...
✘ stefan@localhost ~/Documents/workspace/code/spark master ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.7 -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...</toolchain>\n \n -->z\n\n</... @103:3) @ line 103, column 3
[WARNING]
[INFO] Scanning for projects...
...
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 4.564 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 8.780 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 6.256 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 5.063 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 8.652 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 4.215 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 7.210 s]
[INFO] Spark Project Launcher ............................. SUCCESS [01:07 min]
[INFO] Spark Project Core ................................. SUCCESS [02:13 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 6.008 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 8.864 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 22.931 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:35 min]
[INFO] Spark Project SQL .................................. SUCCESS [02:23 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:17 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 0.616 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:09 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 7.165 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 9.303 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 24.783 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 3.523 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 7.028 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 1.989 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 9.736 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 14.508 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 3.328 s]
[INFO] Spark Avro ......................................... SUCCESS [ 7.217 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:39 min
[INFO] Finished at: 2019-03-17T13:17:52+08:00
[INFO] ------------------------------------------------------------------------
Building With Hive and JDBC Support(支持Hive和JDBC编译)
集成Spark SQL,Hive和JDBC,如果不指定,将默认绑定Hive 1.2.1编译。
编译命令
# With Hive 1.2.1 support
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
building...
stefan@localhost ~/Documents/workspace/code/spark master ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...</toolchain>\n \n -->z\n\n</... @103:3) @ line 103, column 3
[WARNING]
[INFO] Scanning for projects...
...
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 4.719 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 8.717 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 6.270 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 3.983 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 7.893 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 4.385 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 6.898 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 5.493 s]
[INFO] Spark Project Core ................................. SUCCESS [02:13 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 10.281 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 10.138 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 26.678 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [02:23 min]
[INFO] Spark Project SQL .................................. SUCCESS [04:46 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:21 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 1.319 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:04 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 5.929 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 6.662 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 21.103 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 21.623 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 3.794 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 6.660 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 2.034 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 8.895 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 14.781 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 3.565 s]
[INFO] Spark Avro ......................................... SUCCESS [ 5.989 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15:08 min
[INFO] Finished at: 2019-03-17T13:46:54+08:00
[INFO] ------------------------------------------------------------------------
Packaging without Hadoop Dependencies for YARN(不包含hadoop依赖的yarn打包)
采用hadoop-provided profile编译时,会排除hadoop依赖进行编译打包。
编译命令
./build/mvn -Dhadoop-provided -DskipTests clean package
building
stefan@localhost ~/Documents/workspace/code/spark master ./build/mvn -Dhadoop-provided -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...</toolchain>\n \n -->z\n\n</... @103:3) @ line 103, column 3
[WARNING]
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
...
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 5.056 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 8.136 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 5.885 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 4.064 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 13.564 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 6.083 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 16.586 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 6.701 s]
[INFO] Spark Project Core ................................. SUCCESS [02:19 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 7.171 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 9.424 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 32.804 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:31 min]
[INFO] Spark Project SQL .................................. SUCCESS [02:52 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:41 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 0.879 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:14 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 4.553 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 4.331 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 10.777 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 2.870 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 26.260 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 25.948 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 4.794 s]
[INFO] Spark Avro ......................................... SUCCESS [ 8.309 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13:03 min
[INFO] Finished at: 2019-03-17T14:13:57+08:00
[INFO] ------------------------------------------------------------------------
至此,我们演示了几种常用的编译方式。
测试成功
开启spark-shell
✘ didi@localhost ~/Documents/workspace/code/spark master ./bin/spark-shell
19/03/17 13:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://bogon:4040
Spark context available as 'sc' (master = local[*], app id = local-1552805589600).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT
/_/
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
成功打开spark-shell交互界面,说明编译成功。
后面我们将介绍如何在本地进行Spark本地源码的开发测试。
参考:http://spark.apache.org/docs/latest/building-spark.html