pycharm 开发pyspark
2018-01-10 本文已影响0人
wangqiaoshi
下载spark包
配置参数
配置spark参数
vim ${spark_dir}/conf/spark-env.sh
export SPARK_LOCAL_IP=ifconfig|grep -1a en0|grep netmask|awk {'print $2'}
HADOOP_CONF_DIR=$SPARK_HOME/conf
vim ${spark_dir}/conf/spark-defaults.conf
spark.master local
配置系统环境
vim ~/.bash_profile
SPARK_HOME=${spark_dir}
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_PYTHON=python
export SPARK_HOME
因为pycharm会读取.bash_profile,不过执行代码的时候会把PYTHONPATH会覆盖掉.
所以让pycharm先设置PYTHONPATH.
preferences->project interpreter->show all->
image.png
image.png image.png
这样就可以在本地开发spark任务了
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession.builder.appName("PythonPi").getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()