HADOOP
HADOOP:
1.创建Hadoop用户
useradd -m hadoop -s /bin/bash
设置密码
passwd hadoop (WOaiyuyu123)
增加管理员权限
2.hadoop下载地址:
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
3.java环境变量配置:
which java
/usr/bin/java
ls -lrt /usr/bin/java
/usr/bin/java -> /etc/alternatives/java
ls -lrt /etc/alternatives/java
/etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64/jre/bin/java
cd /usr/lib/jvm
vi /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
source /etc/profile
4.查看hadoop版本
bin/hadoop version
HADOOP配置
vim /etc/profile
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.8.5
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
5.修改HADOOP
5.1-修改core-site.xml
<!-- 指定HDFS(namenode)的通信地址 -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- 指定Hadoop运行时产生的存储路径 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
</configuration>
5.2-修改hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>
修改 : /usr/local/hadoop/hadoop-2.8.5/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
5.3格式化节点
bin/hdfs namenode -format
5.4启动
sbin/start-dfs.sh
启动namenode 和 datanode
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
启动失败:需要安装ssh
localhost: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found
localhost: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found
Starting secondary namenodes [0.0.0.0]
0.0.0.0: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found
5.5安装ssh
5.6
passwd root (YUyuaiwo123)
5.7创建目录
bin/hdfs dfs -mkdir -p /user/wudy/input
把wcinput 目录下的文件上传到 /user/wudy/input
bin/hdfs dfs -put wcinput/wc.input /user/wudy/input
- 查看 hdfs 的 /user/wudy/input 路径
http://hadoop2:50070/explorer.html#//user/wudy/input
计算word数量
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /user/wudy/input /user/wudy/output
伪分布式NameNode格式化注意事项:
60941A32-CAD4-4CE5-B8EE-2017D4842041.png
5.8启动YARN 并运行MapReduce程序
分析:
1>配置集群在YARN上运行MR
2>启动、测试集群增删查
3>在YARN上执行WordCount案例
执行步骤:
配置集群:
配置yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
配置 yarn-site.xml
<!-- reducer获取数据的方式 -->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
配置mapred.env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
配置 mapred-site.xml.template
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
启动resourcemanager 和 nodemanager:
sbin/yarn-daemon.sh start resourcemanager (sbin/yarn-daemon.sh stop resourcemanager)
sbin/yarn-daemon.sh start nodemanager (sbin/yarn-daemon.sh stop nodemanager)
jps
- Hadoop端口号:
50070 - 查看HDFS (172.17.0.4:50070/explore.html#/)
8088 - 查看Mapreduce(hadoop4:8088/cluster)
vim yarn-site.xml
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop2:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop2:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop2:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop2:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop2:8088</value>
</property>
6.配置历史服务器(History):
vim mapred-site.xml
<!-- 历史服务器地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop2:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop2:19888</value>
</property>
启动历史服务器: sbin/mr-jobhistory-daemon.sh start historyserver
关闭历史服务器:sbin/mr-jobhistory-daemon.sh stop historyserver
yarn日志聚集:
需要重启NodeManager, ResourceManager, HistoryManager
vim yarn-site.xml
<property>
<name>yarn.log-aggregration-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregration-retain-seconds</name>
<value>604800</value>
</property>
- 7.完全分布式运行模式(开发重点):
- 7.1 编写集群分发脚本 xsync
scp (secure copy) 安全拷贝
scp可以实现服务器与服务器之间的数据拷贝
格式 pdir/$fname (目的用户@主机:目的路径/名称)
scp -r logs root@172.17.0.3:/opt/logs (将hadoop2上的数据推到hadoop3)
在hadoop3机器上, 将hadoop2的logs拉到当前的目录(/opt)下
scp -r root@hadoop2:/usr/local/hadoop/hadoop-2.8.5/logs ./
假如新机器hadoop5, 将hadoop3的配置直接同步到hadoop5:
scp /etc/profile root@hadoop5:/etc/profile
- 7.2 rsync远程同步工具
rsync主要用户同步和镜像,具有速度快、避免复制相同内容和支持符号链接的有点, 只对差异文件做更新
#!/bin/bash
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi
# 获取文件名称
p1=$1
fname=`basename $p1`
echo fname=$fname
# 获取商机目录到绝对路径
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir
# 获取当前用户名称
user=`whoami`
for((host=2; host<5; host++)); do
echo ------------hadoop$host----------
rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
done
- 7.3 集群的配置
集群部署规划:
172.17.0.2(hadoop2) 172.17.0.3(hadoop3) 172.17.0.4(hadoop4)
HDFS NameNode、DataNode DataNode SecondaryNameNode、dataNode
YARN日志聚集 NodeManager ResourceManager、NodeManager NodeManager
配置hadoop2:
vim core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop2:9000</value>
</property>
vim hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop4:50090</value>
</property>
vim yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop3</value>
</property>
vim mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
- rsync下载地址:
https://download.samba.org/pub/rsync/rsync-3.1.3.tar.gz
注意:需要使用xync 把hadoop2的配置同步到hadoop3和hadoop4(我目前是手动修改的)
- 同步命令如下:
rsync -avzP --delete mapred-env.sh root@hadoop3:/usr/local/hadoop/hadoop-2.8.5/etc/hadoop/
先停掉服务
sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemon.sh stop datanode
sbin/yarn-daemon.sh stop nodemanager
sbin/yarn-daemon.sh stop resourcemanager
sbin/mr-jobhistory-daemon.sh stop historyserver
- 将hadoop2、hadoop3, hadoop4的 data logs 目录删除,然后格式化
rm -rf logs
rm -rf /tmp/dfs/name/current
bin/hdfs namenode -format
- 格式化datanode
rm -rf /tmp/dfs/data/current
bin/hdfs datanode -format
hadoop2启动namenode 和 datanode:
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
hadoop3、hadoop4启动datanode:
sbin/hadoop-daemon.sh start datanode
- 配置SSH免密登陆
安装ssh-copy-id
yum -y install openssh-clients
生成id_rsa 和 id_rsa.pubd
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
复制hadoop2公钥到远程主机
cd (cd .ssh)
ssh-copy-id root@172.17.0.3
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.17.0.4
还需要把公约拷贝到自身(hadoop2: ssh hadoop2也是需要输入密码的,我们希望自己ssh访问自己不需要密码, 同理hadoop3和hadoop4):
ssh-copy-id hadoop2
同理,需要在hadoop3上上传公钥, 然后复制到hadoop2和hadoop4
同理,需要在hadoop4上上传公钥, 然后复制到hadoop2和hadoop3
最后查看hadoop2生成的authorized_keys文件 (分别是hadoop3和hadoop4生成的内容)
[root@f71da2a2f780 .ssh]#cat authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCvvYSOd2BbJtzA4E1YnNXFQ78rbRX+bPR9WZARFNQ0Cyh187Sc97W+Bcn1qkxCDHQyI3mwJV/9w66x/Sg7qcBZpjOFpf1F1jT+CUwVaVwhWLj0PhdkvyUYvlMTRkVdl4JYkWezw97p5Sd0OjJ0Lirp92xzByr5Lt128ZqMfvWYE1elR/ZGfcv361U3A6agZyZEV3tvXZ/acqaPfXzbCcP2YGqd8jUKi42rx50dOVvXinSuD8v9mA+LJ0prKzB2dh0PIkCZ9mUBPu8IgyVtYmzZpdNc3bzcaeBjRUOKnlSVTGnssuHl89+mspETgk5y+huLqQ+3XK1aoMXXm0St9CAzujPlwv2kvjcZWSeyAci6/i2KKvML4or42kDZz1nYtzUhcMGoOZrjVMxoLgzs9eUUA4jIZazPf8FX8I5oh7Kpd5HY8XC6B63pFhWpAzlyyW2cq7j9wQDzb2dktzNtrOqEsylCKYMs8cbRXXSaZ2+3UILevtXwt5rI0AYyOwpNdqk= root@e9c4e3e03433
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDJ51DLAzR1kKHI+wrE//oiCT1blAymqEzICbXHrfYM0UcnNCW7J14Pab7KsdaVWhWz0qEPbRInJKqw53klgFK2o3lb7dPCYCeacPl90uI1RxupzEZwrzlZwAeuFgzZeoH4G7I/mCsxmK8upYVeYdoX2BhjbOJQVacRXpdtuMMTTo0VgeXxllWoxR9lJGVwXUn8dbIaQCr5HMGrqwiCuHpPw/zi31TN9V20ubb833eCzXmY+DgtVRoka01Ir8fnuqVAbd404SzwxN9bvM6oyoozK/23UT8tJJNFy2FvzO6trp+2LS+m7IPZN/eSvb0XgQOWV3RiF8e/pYkQ6ep1DA+XnCQsY1qVBk1X7zCr4zC14ounIvnYdHs00+uoAoSRK74uitr4i+GiJaONsN+6n2MQceT3HR1K0KSDhrOZb77EGi4eVH3PaM3+0mmvjVlNoc7gNKtWMe0h8sB6ROGVAAsKxlRjjVxcgtNtVr1nw01KV5HwHBmtWt2BpFMOms0arB8= root@6e59d53ca6b9
重点:群起集群
1.配置slaves
cd /usr/local/hadoop/hadoop-2.8.5/etc/hadoop/
vim slaves
增加内容如下(配置不允许有空格和空行):
hadoop2
hadoop3
hadoop4
(同理如上配置hadoo3核hadoop4)
在hadoop2上集群启动:
(启动hdfs的节点,包括了namenode, datanode,secondarynamenode)
[root@f71da2a2f780 hadoop-2.8.5]# sbin/start-dfs.sh
Starting namenodes on [hadoop2]
hadoop2: starting namenode, logging to /usr/local/hadoop/hadoop-2.8.5/logs/hadoop-root-namenode-f71da2a2f780.out
hadoop3: datanode running as process 3551. Stop it first.
hadoop2: starting datanode, logging to /usr/local/hadoop/hadoop-2.8.5/logs/hadoop-root-datanode-f71da2a2f780.out
hadoop4: datanode running as process 3877. Stop it first.
Starting secondary namenodes [hadoop4]
hadoop4: secondarynamenode running as process 3970. Stop it first.
整体启动YARN
在hadoop3上面启动(停止)yarn:
[root@e9c4e3e03433 hadoop-2.8.5]# sbin/stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
hadoop4: stopping nodemanager
hadoop2: stopping nodemanager
hadoop3: stopping nodemanager
hadoop4: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
hadoop2: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
hadoop3: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
- 注意:NameNode(hadoop2) 和 ResourceManager(hadoop3)如果不是同一台机器, 不能在NameNode上启动YARN,
应该在ResourceManager(hadoop3)所在的机器启动YARN
- 注意:在hadoop2上面启动的hadoop3和hadoop4的节点,需要在hadoop2上面停止,如果直接在hadoop3上执行
[root@e9c4e3e03433 hadoop-2.8.5]# sbin/hadoop-daemon.sh stop datanode
no datanode to stop
这是因为cluster_id对应不上导致的, 如下:
[root@f71da2a2f780 hadoop-2.8.5]# sbin/stop-dfs.sh
Stopping namenodes on [hadoop2]
hadoop2: no namenode to stop
hadoop2: stopping datanode
hadoop4: stopping datanode
hadoop3: stopping datanode
Stopping secondary namenodes [hadoop4]
hadoop4: stopping secondarynamenode
集群基本测试:
[root@f71da2a2f780 hadoop-2.8.5]# bin/hdfs dfs -put wcinput/wc.input /
文件存储的路径:
/usr/local/hadoop/hadoop-2.8.5/tmp/dfs/data/current/BP-365936823-172.17.0.2-1568430331460/current/finalized/subdir0/subdir0/
命令记录:
1> 整体启动/停止HDFS
sbin/start-dfs.sh
/ sbin/stop-dfs.sh
2>整体启动/停止YARN
sbin/start-yarn.sh
/ sbin/stop-yarn.sh
- -----------------集群时间同步----------------------
1.安装crontab
yum -y install vixie-cron crontabs
选项 功能
-e 编辑crontab定时任务
-l 查询crontab任务
-r 删除当前用户所有的crontab任务
编辑:crontan -e
* * * * * 执行的任务
项目 含义 范围
1.第一个"" 一小时当中的第几分钟 0~59
2.第二个"" 一天当中的第几个小时 0~23
3.第三个"" 一个月当中的第几天 1~31
4.第四个"" 一年当中的第几个月 1~12
5.第五个"*" 一周当中的星期几 0~7 (0和7都代表星期日)
特殊符号 含义
1 * 代表任何时间,比如第一个*就代表一小时中每分钟都执行一次的意思
2 , 代表不连续的时间,比如"0 8,12,16 * * *", 代表在每天的8点0分,12点0分,16点0分都执行一次命令
3 - 代表连续的时间范围,比如 "05 * * * 1-6", 代表在周一到周六的凌晨5点0分执行命令
4 */n 代表每隔多久执行一次, 比如"/10 * * * *", 代表每隔10分钟执行一次
-
操作实践
1513F998-C008-40A7-8540-A265517E385E.png
: