01-pySpark 安装

2019-10-31  本文已影响0人  过桥

Linux 下载spark问题

gzip: stdin: not in gzip format

下载地址

[mongodb@mongodb02 software]$ sudo curl -O https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29675    0 29675    0     0  15922      0 --:--:--  0:00:01 --:--:-- 15920
[mongodb@mongodb02 software]$ ll
总用量 32
-rw-r--r--. 1 root root 29675 10月 29 16:51 spark-2.4.4-bin-hadoop2.7.tgz

解压文件无法解压

[mongodb@mongodb02 software]$ tar zxvf spark-2.4.4-bin-hadoop2.7.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

错误排查一,再次确认文件是否下载成功

错误排查二,不使用gzip解压缩

[mongodb@mongodb02 software]$ tar -xvf spark-2.4.4-bin-hadoop2.7.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

错误排查三,检查文件格式,发现找到的下载地址指向的是页面,不是文件

[mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
spark-2.4.4-bin-hadoop2.7.tgz: HTML document, ASCII text, with very long lines

重新下载spark文件

下载地址

[mongodb@mongodb02 software]$ sudo curl -O http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

检查压缩格式

[mongodb@mongodb02 software]$ file spark-2.4.4-bin-hadoop2.7.tgz
spark-2.4.4-bin-hadoop2.7.tgz: gzip compressed data, from Unix, last modified: Wed Aug 28 05:30:23 2019

解压缩

[mongodb@mongodb02 software]$ sudo tar zxvf spark-2.4.4-bin-hadoop2.7.tgz

测试

启动pyspark

[mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$ ./bin/pyspark
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
19/10/30 16:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> 

执行测试代码

>>> rdd = sc.parallelize([1,2,3,4,5])
>>> rdd.reduce(lambda x,y:x+y)
15                                                                              
>>> rdd.map(lambda x:x+1)
PythonRDD[4] at RDD at PythonRDD.scala:53
>>> rdd.reduce(lambda x,y:x+y)
15
>>> rdd.map(lambda x:x+1).reduce(lambda x,y:x+y)
20
>>> exit()
[mongodb@mongodb02 spark-2.4.4-bin-hadoop2.7]$ 

观察代码可知,如不执行action 操作,那么其实系统并不会将 RDD 进行Transformations,而是记录下转换的顺序与方法, 在执行action操作时统一执行Transformations

transformations

actions

Windows 安装

下载包解压至相关目录

D:\S_Software\spark-2.4.4-bin-hadoop2.7

将目录添加至系统环境变量

我的电脑 -> 右键属性 -> 高级系统设置

Path -> 编辑 -> 添加 D:\S_Software\spark-2.4.4-bin-hadoop2.7\bin

测试

运行 -> cmd -> pyspark

测试代码参考前面内容

上一篇下一篇

猜你喜欢

热点阅读