(一)ubuntu上搭建hadoop开发环境
1 安装JDK
(1)下载jdk-8u65-linux-x64.tar.gz
(2) 解压缩上述软件包,并将之拷贝到~/soft/目录下
mkdir ~/soft/
tar -xzvf jdk-8u65-linux-x64.tar.gz
cp jdk1.8.0_65 ./soft
(3) 创建符号链接
ln -s ~/soft/jdk-1.8.0_65 ~/soft/jdk
(4)验证jdk安装是否成功
cd ~/soft/jdk/bin
./java -version
(5)配置JDK环境变量
- (a)编辑/etc/profile
$>sudo vim /etc/profile
export JAVA_HOME=/home/henry/soft/jdk
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
- (b)使环境变量即刻生效
source /etc/profile
- (c)进入任意目录下,测试是否ok
$>cd ~
$>java -version
2 安装hadoop
(1)下载hadoop-2.7.3.tar.gz
(2)解压缩后并将之拷贝到~/soft目录下
$>tar -xzvf hadoop-2.7.3.tar.gz
$>cp ./hadoop-2.7.3 ~/soft/
(3)创建符号连接
$>ln -s /soft/hadoop-2.7.3 ~/soft/hadoop
(4)验证hadoop安装是否成功
$>cd ~/soft/hadoop/bin
henry@s201:~/soft/hadoop/bin$ ./hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /home/henry/soft/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar
(5)配置hadoop环境变量
$>sudo vim /etc/profile
export HADOOP_HOME=/home/henry/soft/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
#使其立即生效
$>source /etc/profile
3 hadoop的三个启动模式
3.1 standalone(local)模式
- 该模式下不需要做任何事情,不需要启用单独的hadoop进程;
- 本地模式下使用的是本地操作系统的文件系统
hdfs dfs -ls /home
3.2 Pseudodistributed mode(伪分布模式)
伪分布模式下,所有的节点均运行在同一台计算机上,其配置如下。
(1)进入${HADOOP_HOME}/etc/hadoop目录
(2)编辑core-site.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
(3)编辑hdfs-site.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
(4)编辑mapred-site.xml文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(5)编辑yarn-site.xml文件
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
(6)配置SSH是的能够使用SSH免密登录本机
SSH无密登录示意图
- (a)安装ssh相关软件包(openssh-server + openssh-clients + openssh)
$>sudo apt-get install openssh-server
$>sudo apt-get install openssh-clients
- (b)检查是否启动了sshd进程
henry@s201:~/soft/hadoop/etc/pseudo$ ps -Af | grep ssh
root 983 1 0 12:20 ? 00:00:00 /usr/sbin/sshd -D
henry 1306 1149 0 12:20 ? 00:00:00 gnome-keyring-daemon --start --components ssh
henry 7171 2140 0 13:21 pts/17 00:00:00 grep --color=auto ssh
- (c)在client侧生成公私秘钥对
$>ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
- (d) 生成~/.ssh文件夹,里面有id_rsa(私钥) + id_rsa.pub(公钥)
henry@s201:~/.ssh$ ls
authorized_keys id_rsa id_rsa.pub known_hosts
- (e)追加公钥到~/.ssh/authorized_keys文件中(文件名、位置固定)
$>cd ~/.ssh
$>cat id_rsa.pub >> authorized_keys
- (f) 修改authorized_keys的权限为644.(ubuntu下不需要修改该权限)
$>chmod 644 authorized_keys
- (g)测试
$>ssh localhost
(7)对hdfs进行格式化
$>hadoop namenode -format
(8)修改hadoop配置文件,手动指定JAVA_HOME环境变量
[${hadoop_home}/etc/hadoop/hadoop-env.sh]
...
export JAVA_HOME=/home/henry/soft/jdk
...
(9)启动hadoop的所有进程
$>start-all.sh
(10)使用jps查看已经启动的进程
$>jps
33702 NameNode
33792 DataNode
33954 SecondaryNameNode
29041 ResourceManager
34191 NodeManager
(11)查看hdfs文件系统
$>hdfs dfs -ls /
(12)在hdfs文件系统上创建目录
hdfs dfs -mkdir -p /home/henry/hadoop
(13) 通过webui查看hadoop的文件系统
在浏览器中输入http://localhost:50070/来查看是否搭建成功。
(14)使用stop-all.sh结束进程
(15)小结
-
hadoop的端口
50070 //namenode http port
50075 //datanode http port
50090 //2namenode http port
8020 //namenode rpc port
50010 //datanode rpc port -
hadoop四大模块
common
hdfs //namenode + datanode + secondarynamenode
mapred
yarn //resourcemanager + nodemanager -
启动脚本
1.start-all.sh //启动所有进程
2.stop-all.sh //停止所有进程
3.start-dfs.sh //
4.start-yarn.sh
3.3 full distributed(完全分布式)
下图所示为本节将要配置的四台主机示意图,其中包含一个namenode节点,三个datanode节点。四台主机的IP地址和主机名分别为:
192.168.2.201 s201 ——namenode节点
192.168.2.202 s202 ——datanode节点
192.168.2.203 s203 ——datanode节点
192.168.2.204 s204 ——datanode节点
3.3.1 克隆3台虚拟机并设置IP地址、修改主机名
(1) 设置IP地址
其中路由器的IP地址为192.168.2.1,图示操作系统为VMware虚拟机中的Ubunut16.04系统,使用桥接模式。
(2)修改主机名
$> sudo gedit /etc/hostname
s201
$>sudo gedit /etc/hosts
127.0.0.1 localhost
192.168.2.201 s201
192.168.2.202 s202
192.168.2.203 s203
192.168.2.204 s204
(3)克隆3台ubuntu虚拟机并更改主机名
虚拟机->管理->克隆->完整克隆
更改克隆的三台虚拟机的主机名及IP地址:
s202—— 192.168.2.202
s203—— 192.168.2.203
s204—— 192.168.2.204
3.3.2 准备完全分布式主机的ssh
(1)删除所有主机上的/home/henry/.ssh/*
(2)在s201主机上生成密钥对
$>ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
(3)将s201的公钥文件id_rsa.pub远程复制到202 ~ 204主机上, 并放置/home/henry/.ssh/authorized_keys中
$>scp id_rsa.pub henry@s201:/home/henry/.ssh/authorized_keys
$>scp id_rsa.pub henry@s202:/home/henry/.ssh/authorized_keys
$>scp id_rsa.pub henry@s203:/home/henry/.ssh/authorized_keys
$>scp id_rsa.pub henry@s204:/home/henry/.ssh/authorized_keys
3.3.3 配置完全分布式的配置文件(${hadoop_home}/etc/hadoop/)并启动hadoop进程
(1)在s201主机上配置core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml,slaves,hadoop-env.sh文件
[core-site.xml]
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://s201/</value>
</property>
</configuration>
[hdfs-site.xml]
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
[mapred-site.xml]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
[yarn-site.xml]
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>s201</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
[slaves]
s202
s203
s204
[hadoop-env.sh]
...
export JAVA_HOME=/home/henry/soft/jdk
...
(2)删除s201主机上/home/henry/soft/hadoop/logs目录下的所有日志文件
$>cd /soft/hadoop/logs
$>rm -rf *
(3)同步/home/henry/soft/hadoop目录到s202,s203,s204主机
$>rsync -lr /home/henry/soft/hadoop henry@s202:/home/henry/soft/hadoop
$>rsync -lr /home/henry/soft/hadoop henry@s203:/home/henry/soft/hadoop
$>rsync -lr /home/henry/soft/hadoop henry@s204:/home/henry/soft/hadoop
(4)删除四台主机上的临时目录文件
$>cd /tmp
$>rm -rf hadoop-henry
$>ssh s202 rm -rf /tmp/hadoop-henry
$>ssh s203 rm -rf /tmp/hadoop-henry
$>ssh s204 rm -rf /tmp/hadoop-henry
(5)格式化文件系统
$>hadoop namenode -format
(6)启动hadoop进程
$>start-all.sh
(7)使用jps命令查看进程
henry@s201:/usr/local/bin$ xcall.sh jps
============= s201 jps =============
5828 ResourceManager
5462 NameNode
8553 Jps
5676 SecondaryNameNode
============= s202 jps =============
32978 Jps
30546 NodeManager
30415 DataNode
============= s203 jps =============
30812 NodeManager
32927 Jps
30687 DataNode
============= s204 jps =============
30289 DataNode
32833 Jps
30426 NodeManager
(8)使用webui查看显示页面
4 更改hadoop的存储目录
hadoop默认存储在/tmp/hadoop-username目录下,若想更改其目录需要在core-site.xml(所有主机中的该文件均需修改)中添加如下属性,并将其中的value值修改为指定目录即可。
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>