01-pySpark 安装
2019-10-31 本文已影响0人
过桥
Linux 下载spark问题
gzip: stdin: not in gzip format
[mongodb@mongodb02 software]$ sudo curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 29675 0 29675 0 0 15922 0 --:--:-- 0:00:01 --:--:-- 15920
[mongodb@mongodb02 software]$ ll
总用量 32
-rw-r--r--. 1 root root 29675 10月 29 16:51 spark-2.4.4-bin-hadoop2.7.tgz
解压文件无法解压
[mongodb@mongodb02 software]$ tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
错误排查一,再次确认文件是否下载成功
错误排查二,不使用gzip解压缩
[mongodb@mongodb02 software]$ tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
错误排查三,检查文件格式,发现找到的下载地址指向的是页面,不是文件
[mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
spark-2.4.4-bin-hadoop2.7.tgz: HTML document, ASCII text, with very long lines
重新下载spark
文件
[mongodb@mongodb02 software]$ sudo curl -O http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
检查压缩格式
[mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
spark-2.4.4-bin-hadoop2.7.tgz: gzip compressed data, from Unix, last modified: Wed Aug 28 05:30:23 2019
解压缩
[mongodb@mongodb02 software]$ sudo tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
测试
启动pyspark
[mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$ ./bin/pyspark
Python 2.7.5 (default, Oct 30 2018, 23:45:53)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
19/10/30 16:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>>
执行测试代码
>>> rdd = sc.parallelize([1,2,3,4,5])
>>> rdd.reduce(lambda x,y:x+y)
15
>>> rdd.map(lambda x:x+1)
PythonRDD[4] at RDD at PythonRDD.scala:53
>>> rdd.reduce(lambda x,y:x+y)
15
>>> rdd.map(lambda x:x+1).reduce(lambda x,y:x+y)
20
>>> exit()
[mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$
观察代码可知,如不执行action
操作,那么其实系统并不会将 RDD 进行Transformations
,而是记录下转换的顺序与方法, 在执行action
操作时统一执行Transformations
。
Windows 安装
下载包解压至相关目录
D:\S_Software\spark-2.4.4-bin-hadoop2.7
将目录添加至系统环境变量
我的电脑 -> 右键属性 -> 高级系统设置
Path -> 编辑 -> 添加
D:\S_Software\spark-2.4.4-bin-hadoop2.7\bin
测试
运行 -> cmd -> pyspark
测试代码参考前面内容