WMware Spark集群搭建
准备工作
vmware + centos7安装略
hostname 设置
hostnamectl set-hostname "master" // 设置主机名字
hostnamectl status --transient // 查看临时主机名字
hostnamectl status --static // 查看主机名字
hostnamectl status // 查看主机目前基础信息
防火墙
查看防火墙
systemctl status firewalld.service // 查看防火墙状态
systemctl stop firewalld.service // 关闭防火墙
systemctl disable firewalld.service //永久停用防火墙
systemctl status firewalld.service // 再次查看防火墙状态
更新并安装必要的工具
测试是否联网ping www.baidu.com
如果可以上网那么更新yum并安装ifconfig,也可以直接使用ip addr
sudo yum update
sudo yum install -y net-tools // 安装ifconfig 工具
sudo yum install -y vim
配置静态IP
使用ifconfig查看网卡是什么,排除掉lo,我这边是ens33(NAT方式)
首先在虚拟机添加一块bridge网卡,然后直接编辑这个网卡如果没有发现需要复制下ens33(最好先安装一下vmware-tools)
cp /etc/sysconfig/network-scripts/ifcfg-ens33 /etc/sysconfig/network-scripts/ifcfg-ens37
ens37配置
TYPE=Ethernet
iOOTPROTO=static
DEVICE=ens37
NAME=ens37
ONBOOT=yes
IPADDR="192.168.15.100"
NETMASK="255.255.255.0"
完成以后service network restart重启网络,如果显示Ok,那么就表示配置没有问题了。
最后使用ifconfig查看下我们的ip是否已经按照我们设定的配置好了。
修改host
vim /etc/hosts
192.168.15.100 master // 新添加,机器IP 机器名字
192.168.15.101 slave1 // 新添加,ip地址可以在所有机器修改完之后再进行修改
192.168.15.102 slave2 // 新添加,ip地址可以在所有机器修改完之后再进行修改
安装JAVA
Hadoop和spark都运行在java7以上版本,这里下载java8,这里说明下tar.gz,这种类型仅仅需要解压以及配置环境变量即可,便于管理
wget下载jdk,这是一个大坑
wget http://download.oracle.com/otn-pub/java/jdk/8u151-b12/e758a0de34e24606bca991d704f6dcbf/jdk-8u151-linux-x64.tar.gz
tar zxvf jdk-8u151-linux-x64.tar
执行一直报错,而且莫名其妙
tar (child): jdk-8u151-linux-x64.tar: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
查了下文档,是因为我没有同意java license直接下载了,所以不能解压!! 坑爹啊
正解:下载java到本地以后在上传到服务器上去
搭建java
cp -r jdk1.8/ /usr/java/
配置环境变量
vim /etc/profile
文件结尾添加
export JAVA_HOME=/usr/java
export JRE_HOME=/usr/java/jre
export JAVA_BIN=/usr/java/bin
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME JAVA_BIN PATH CLASSPATH
让其立即生效source /etc/profile
,测试java -version
Hadoop安装
下载hadoop,这里选择stable2的,跟java一样还是tar.gz
吸取了上次的经验,现在我都是下载到本地以后在上传到服务器上去
tar zxvf hadoop-2.9.0.tar.gz
将hadoop放入/home/hadoop/hadoop-2.9.0,在/etc/profile中添加
export HADOOP_HOME=/home/hadoop/hadoop-2.9.0
export PATH=.:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
然后让其立即生效 source /etc/profile
Hadoop配置
这里需要注意一点下面所有的配置均在/home/hadoop/hadoop-2.9.0/etc/hadoop中完成
这里我仅仅说明下/home/hadoop/hadoop-2.9.0为hadoop的主目录
-
配置slaves
仅仅有两台slave,配置如下
slave1
slave2
2.修改hadoop-env.sh和yarn-env.sh中java_home
将export JAVA_HOME=${JAVA_HOME}
修改为对应的JAVA_HOME路径
3.修改core-site.xml配置
创建一个目录(/home/hadoop/hadoop-2.9.0/tmp)存放tmp文件
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoop-2.9.0/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131702</value>
</property>
</configuration>
备注:前两个设置是必须的,后面可以不加
4.修改hdfs-site.xml配置
在hadoop_home目录下创建hdfs文件夹, 完成后在hdfs目录下创建name以及data文件夹
<property>
<name>dfs.namenode.name.dir</name
> <value>file:/home/hadoop/hadoop-2.9.0/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop-2.9.0/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
5.修改map-red.xml配置
/etc/hadoop/目录下没有这个文件只有它的模板文件(mapred-site.xml.template),需要复制出来这个文件
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>master:50030</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
6.修改yarn-site.xml配置
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
目前仅仅完成简单的配置,三台hadoop才可以进行集群测试
Hadoop 集群搭建
slaves创建
copy两个虚拟机(用于slave1 slave2),然后进行修改虚拟机参数
1.修改静态IP
vim /etc/sysconfig/network-scripts/ifcfg-ens37
TYPE=Ethernet
iOOTPROTO=static
DEVICE=ens37
NAME=ens37
ONBOOT=yes
IPADDR="192.168.15.101" # slave1 101 slave2 102
NETMASK="255.255.255.0"
2.修改机器名字
hostnamectl set-hostname "slave1" // 设置主机名字 "slave1"/"slave2"
hostnamectl status --transient // 查看临时主机名字
hostnamectl status --static // 查看主机名字
hostnamectl status // 查看主机目前基础信息
集群测试
1.启动master&slaves集群
cd $HADOOP_HOME
./bin/hdfs namenode -formate // 一次即可 格式化namenode
./sbin/start-all.sh // 启动 dfs 和yarn
紧接着会让你依次输入master root用户密码,以及slave1 & slave2 的root用户密码,这样带来很大的麻烦;
下面来设置免密码登录
master&slave1&slave2均需要设置
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys // 记录机器key的文件,下次免认证
chmod 600 ~/.ssh/authorized_keys
测试ssh localhost是否可以本地成功(必须先做),下面配置远程连接
vim /etc/ssh/sshd_config
PubkeyAuthentication yes // 取消这行的注释
完成以后重启服务service sshd restart
下面将master中的~/.ssh/id_rsa.pub下载到本地,重命名为id_rsa_master.pub;分别上传到slave1&slave2上。
分别切换到slave1&slave2执行下面命令
cat ~/.ssh/id_rsa_master.pub>> ~/.ssh/authorized_keys
如果没有什么问题的话,显示如下
master: starting namenode, logging to /home/hadoop/hadoop-2.9.0/logs/hadoop-root-namenode-master.out
slave2: starting datanode, logging to /home/hadoop/hadoop-2.9.0/logs/hadoop-root-datanode-slave2.out
slave1: starting datanode, logging to /home/hadoop/hadoop-2.9.0/logs/hadoop-root-datanode-slave1.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /home/hadoop/hadoop-2.9.0/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.9.0/logs/yarn-root-resourcemanager-master.out
slave1: starting nodemanager, logging to /home/hadoop/hadoop-2.9.0/logs/yarn-root-nodemanager-slave1.out
slave2: starting nodemanager, logging to /home/hadoop/hadoop-2.9.0/logs/yarn-root-nodemanager-slave2.out
2.查看启动情况
查看master运行情况
切换到master输入jps
[root@master hadoop]# jps
2373 SecondaryNameNode
3735 Jps
2526 ResourceManager
切换到slave1输入jps
[root@slave1 ~]# jps
2304 DataNode
2405 NodeManager
2538 Jps
切换到slave2输入jps
[root@slave2 ~]# jps
1987 DataNode
2293 NodeManager
2614 Jps
如果均启动成功那我们宿主机器看下链接,如果可以在Active Nodes看到有2个那么就代表对了
关机前需要在master机器上执行,用于停hadoop
cd $HADOOP_HOME
./sbin/stop-all.sh
spark 配置
下载并安装scala
下载scala,这里我们选择的是scala-2.11.6.tgz。老样子还是下载到本地,然后在上传到服务器。
解压上传文件并一起放入/usr/scala
tar zxvf scala-2.11.6.tgz
mkdir /usr/scala
mv scala-2.11.6 /usr/scala
编辑/etc/profile*,加入scala目录到环境变量中去
export SCALA_HOME=/usr/scala/scala-2.11.6
export PATH=$PATH:$SCALA_HOME/bin
source /etc/profile
scala -version
检查是否scala安装成功
下载并安装spark
1.下载spark
下载spark,这里我们选择
spark-2.2.1-bin-hadoop2.6.tgz。
解压文件放入/home/hadoop/目录下
[root@master hadoop]# ls
hadoop-2.9.0 spark-2.2.1
将spark加入环境变量中
export SPARK_HOME=/home/hadoop/spark-2.2.1
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
2.修改spark参数(集群)
切换到/home/hadoop/spark-2.2.1/conf目录下
修改spark-env.sh
因为spark-env.sh不存在,所以我们需要copy一份(spark-env.sh.template)
cp spark-env.sh.template spark-env.sh
编辑这个文件,依次加入java、Scala、hadoop、spark的环境变量,以使其能够正常到运行
export JAVA_HOME=/usr/java
export SCALA_HOME=/usr/scala/scala-2.11.6
export SPARK_MASTER=192.168.15.100 // master ip address
export SPARK_WORKER_MEMORY=1g
export HADOOP_HOME=/home/hadoop/hadoop-2.9.0
修改slaves
同样slaves文件也不存在,我们需要cp slaves.template slaves
,完成后添加下面内容
master
slave1
slave2
到这里master节点都已经完全配置完成。下面我们需要按照上面配置依次完成slave1 & slave2
3.测试spark集群
切换到master主机,首先启动haddop
$HADOOP_HOME/sbin/start-all.sh
完成后启动spark
$HADOOP_HOME/../spark-2.2.1/sbin/start-all.sh
运行结果
[root@master sbin]# $HADOOP_HOME/../spark-2.2.1/sbin/start-all.sh
rsync from 192.168.15.100
/home/hadoop/spark-2.2.1/sbin/spark-daemon.sh: line 170: rsync: command not found
starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/spark-2.2.1/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
slave2: rsync from 192.168.15.102
slave2: /home/hadoop/spark-2.2.1/sbin/spark-daemon.sh: line 170: rsync: command not found
slave1: rsync from 192.168.15.101
slave1: /home/hadoop/spark-2.2.1/sbin/spark-daemon.sh: line 170: rsync: command not found
slave2: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark-2.2.1/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
slave1: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark-2.2.1/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave1.out
master: rsync from 192.168.15.100
master: /home/hadoop/spark-2.2.1/sbin/spark-daemon.sh: line 170: rsync: command not found
master: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark-2.2.1/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-master.out
这里有一个小报错,/home/hadoop/spark-2.2.1/sbin/spark-daemon.sh: line 170: rsync: command not found
;这里我们使用yum install rsync,安装下rsync(错误即可消失)。
如果没有问题的话,可以访问hadoop以及spark
关闭服务器先关闭spark, 然后在关闭hadoop
[root@master sbin]# $HADOOP_HOME/../spark-2.2.1/sbin/stop-all.sh
slave1: stopping org.apache.spark.deploy.worker.Worker
slave2: stopping org.apache.spark.deploy.worker.Worker
master: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[root@master sbin]# $HADOOP_HOME/sbin/stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [master]
master: no namenode to stop
slave1: stopping datanode
slave2: stopping datanode
Stopping secondary namenodes [master]
master: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
slave2: stopping nodemanager
slave1: stopping nodemanager
slave2: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
slave1: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop
搭建jupyter环境
1.下载Anaconda3
下载Anaconda3,懒得在上传了直接wget吧
yum install -y bzip2 // 先安装bunzip2
wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
./Anaconda3-5.0.1-Linux-x86_64.sh // 安装Anaconda3
这里默认目录是/root/anaconda3 我这里调整为/usr/anaconda3
同样编辑/etc/profile添加
export PATH=$PATH:/usr/anaconda3/bin
测试是否可以运行成功jupyter notebook
,如果命令可以运行不报错,那么就表示OK了
下面创建一个密文的密码
from notebook.auth import passwd
passwd()
Enter password: ········
Verify password: ········
Out[1]:
'sha1:3da3aa9aedb0:deb470c78d2857a1f7d1c11138e4d4d8ad5ecbaf'
2.配置jupyter属性
首先配置jupyter可以外网访问
[root@master bin]# jupyter notebook --generate-config --allow-root
Writing default config to: /root/.jupyter/jupyter_notebook_config.py
编辑/root/.jupyter/jupyter_notebook_config.py并释放出如下参数,以及修改如下
vim /root/.jupyter/jupyter_notebook_config.py
小技巧 v编辑的时候可以输入/进行search字符串
c.NotebookApp.open_browser = False # 禁止在运行ipython的同时弹出浏览器
c.NotebookApp.password = u`sha1:3da3aa9aedb0:deb470c78d2857a1f7d1c11138e4d4d8ad5ecbaf` # 如果不配置会出现?token=
c.NotebookApp.port = 5000 # 指定port
c.NotebookApp.ip = '192.168.15.100' # 本机ip
c.NotebookApp.allow_root=True # 允许root否则又一神坑
touch /root/.jupyter/jupyter_notebook_config.py
测试是否安装成功 jupyter notebook
[I 11:23:55.382 NotebookApp] JupyterLab alpha preview extension loaded from /usr/anaconda3/lib/python3.6/site-packages/jupyterlab
JupyterLab v0.27.0
Known labextensions:
[I 11:23:55.383 NotebookApp] Running the core application with no additional extensions or settings
[I 11:23:55.384 NotebookApp] Serving notebooks from local directory: /home/hadoop/spark-2.2.1/bin
[I 11:23:55.384 NotebookApp] 0 active kernels
[I 11:23:55.384 NotebookApp] The Jupyter Notebook is running at: http://192.168.15.100:5000/
[I 11:23:55.384 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
复制http://192.168.15.100:5000/到本机进行测试。
- 配置jupyter启用spark kernel
查看目前启用python用到的kernel
[root@master bin]# jupyter kernelspec list
Available kernels:
python3 /usr/anaconda3/share/jupyter/kernels/python3
配置:在运行pyspark(切换到/home/hadoop/spark-2.2.1/bin)的时候其实是运行jupyter notebook
在./bashrc文件中添加下面设置,设置默认pyspark属性
vim ~/.bashrc
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
完成后pyspark
, 即可启动pyspark
然后开始我们的pyspark之旅吧!!!!
致谢
Hadoop Cluster Setup
Hadoop: Setting up a Single Node Cluster
Spark 开发环境搭建系列
利用docker快速搭建Spark集群
部署Jupyter/IPython Notebook Memo
基于pyspark 和scala spark的jupyter notebook 安装