我爱编程

Hadoop环境搭建

2018-01-10  本文已影响0人  发光如星_275d

开始正式搭建Hadoop环境,之所以分开写,也是便于阅读及以后参考。

Hadoop环境的搭建过程中,需要配置几个XML文件,接下来简单介绍下这几个XML文件的作用

core-site.xml

主要配置NameNode的一些信息,主要属性如下:

<configuration>
    <!-- 指定hdfs的nameservice名称空间为ns -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://ns</value>
    </property>
    <!-- 指定hadoop临时目录,默认在/tmp/{$user}目录下,不安全,每次开机都会被清空-->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop/hdpdata/</value>
        <description>需要手动创建hdpdata目录</description>
    </property>
    <!-- 指定zookeeper地址 -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>node01:2181,node02:2181,node03:2181</value>
        <description>zookeeper地址,多个用逗号隔开</description>
    </property>
</configuration>

hdfs-site.xml

主要配置hdfs

<configuration>
    <!-- NameNode HA配置 -->
    <property>
        <name>dfs.nameservices</name>
        <value>ns</value>
        <description>指定hdfs的nameservice为ns,需要和core-site.xml中的保持一致</description>
    </property>
    <property>
        <name>dfs.ha.namenodes.ns</name>
        <value>nn1,nn2</value>
        <description>ns命名空间下有两个NameNode,逻辑代号,随便起名字,分别是nn1,nn2</description>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns.nn1</name>
        <value>node01:9000</value>
        <description>nn1的RPC通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns.nn1</name>
        <value>node01:50070</value>
        <description>nn1的http通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns.nn2</name>
        <value>node02:9000</value>
        <description>nn2的RPC通信地址</description>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns.nn2</name>
        <value>node02:50070</value>
        <description>nn2的http通信地址</description>
    </property>
    <!--JournalNode配置 -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://node03:8485;node04:8485;node05:8485/ns</value>
        <description>指定NameNode的edits元数据在JournalNode上的存放位置</description>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/usr/local/hadoop/journaldata</value>
        <description>指定JournalNode在本地磁盘存放数据的位置</description>
    </property>
    <!--namenode高可用主备切换配置 -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
        <description>开启NameNode失败自动切换</description>
    </property>
    <property>
        <name>dfs.client.failover.proxy.provider.ns</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        <description>配置失败自动切换实现方式,使用内置的zkfc</description>
    </property>
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>
            sshfence
            shell(/bin/true)
        </value>
        <description>配置隔离机制,多个机制用换行分割,先执行sshfence,执行失败后执行shell(/bin/true),/bin/true会直接返回0表示成功</description>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/home/hadoop/.ssh/id_rsa</value>
        <description>使用sshfence隔离机制时需要ssh免登陆</description>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>30000</value>
        <description>配置sshfence隔离机制超时时间</description>
    </property>
    <!--dfs文件属性设置-->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <description>设置block副本数为3</description>
    </property>

    <property>
        <name>dfs.block.size</name>
        <value>134217728</value>
        <description>设置block大小是128M</description>
    </property>
</configuration>

yarn-site.xml

yarn的一些配置信息
由于我们是在虚拟机中配置,hadoop默认集群内存大小为8192MB,一般来说是不够的,需要加参数限制,见下方对内存的设置,当然土豪随意

<configuration>
    <!-- 开启RM高可用 -->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <!-- 指定RM的cluster id,一组高可用的rm共同的逻辑id -->
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yarn-ha</value>
    </property>
    <!-- 指定RM的名字,可以随便自定义 -->
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>
    <!-- 分别指定RM的地址 -->
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>node01</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm1</name>
        <value>${yarn.resourcemanager.hostname.rm1}:8088</value>
        <description>HTTP访问的端口号</description>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>node02</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm2</name>
        <value>${yarn.resourcemanager.hostname.rm2}:8088</value>
    </property>
    <!-- 指定zookeeper集群地址 -->
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>node01:2181,node02:2181,node03:2181</value>
    </property>
    <!--NodeManager上运行的附属服务,需配置成mapreduce_shuffle,才可运行MapReduce程序-->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 开启日志聚合 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- 日志聚合HDFS目录 -->
    <property>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>/data/hadoop/yarn-logs</value>
    </property>
    <!-- 日志保存时间3days,单位秒 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>259200</value>
    </property>
        <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
        <discription>单个任务可申请最少内存,默认1024MB</discription>
     </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>2048</value>
        <discription>nodemanager默认内存大小,默认为8192MB(value单位为MB)</discription>
        </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>1</value>
                 <discription>nodemanager cpu内核数</discription>
    </property>
</configuration>

环境搭建

主机名 IP 安装软件 运行的进程
node01 192.168.47.10 JDK、Hadoop、Zookeeper NameNode(Active)、DFSZKFailoverController(zkfc)、ResourceManager(Standby)、QuorumPeerMain(Zookeeper)
node02 192.168.47.11 JDK、Hadoop、Zookeeper NameNode(Standby)、DFSZKFailoverController(zkfc)、ResourceManager(Active)、QuorumPeerMain(Zookeeper)、Jobhistory
node03 192.168.47.12 JDK、Hadoop、Zookeeper DataNode、NodeManager、QuorumPeerMain(Zookeeper)、JournalNode
node04 192.168.47.13 JDK、Hadoop DataNode、NodeManager、JournalNode
node05 192.168.47.14 JDK、Hadoop DataNode、NodeManager、JournalNode

本次环境搭建规划如下:

主机名 IP 安装软件 运行的进程
node01 192.168.47.10 JDK、Hadoop、Zookeeper NameNode(Active)、DFSZKFailoverController(zkfc)、ResourceManager(Standby)、QuorumPeerMain(Zookeeper)
node02 192.168.47.11 JDK、Hadoop、Zookeeper NameNode(Standby)、DFSZKFailoverController(zkfc)、ResourceManager(Active)、QuorumPeerMain(Zookeeper)、Jobhistory
node03 192.168.47.12 JDK、Hadoop、Zookeeper DataNode、NodeManager、QuorumPeerMain(Zookeeper)、JournalNode
node04 192.168.47.13 JDK、Hadoop DataNode、NodeManager、JournalNode
node05 192.168.47.14 JDK、Hadoop DataNode、NodeManager、JournalNode

搭建过程中,无特殊说明,均为使用hadoop用户

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>指定mr框架为yarn方式 </description>
    </property>
    <!-- 历史日志服务jobhistory相关配置 -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node02:10020</value>
        <description>历史服务器端口号</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node02:19888</value>
        <description>历史服务器的WEB UI端口号</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.joblist.cache.size</name>
        <value>2000</value>
        <description>内存中缓存的historyfile文件信息(主要是job对应的文件目录)</description>
    </property>
</configuration>

配置hadoop用户免密码登录

需要配置集群之间的免密码登录

在其他节点配置hadoop

启动集群

HDFS HTTP访问地址
NameNode (active):http://192.168.47.10:50070
NameNode (standby):http://192.168.47.11:50070
ResourceManager HTTP访问地址
ResourceManager :http://192.168.47.11:8088
历史日志HTTP访问地址
JobHistoryServer:http://192.168.47.11:19888

集群验证

  1. 向hdfs上传一个文件 hadoop fs -put /usr/local/hadoop/README.txt /
  2. 在active节点手动关闭active的namenodesbin/hadoop-daemon.sh stop namenode
  3. 通过HTTP 50070端口查看standby namenode的状态是否转换为active,再手动启动上一步关闭的namenodesbin/hadoop-daemon.sh start namenode
  1. 运行测试hadoop提供的demo中的WordCount程序:
hadoop fs -mkdir /wordcount
hadoop fs -mkdir /wordcount/input 
hadoop fs -mv /README.txt /wordcount/input 
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar wordcount /wordcount/input  /wordcount/output

注意:HDFS 上/wordcount/output 为输出路径不能存在,否则任务不能执行

  1. 手动关闭node02的ResourceManagersbin/yarn-daemon.sh stop resourcemanager
  2. 通过HTTP 8088端口访问node01的ResourceManager查看状态
  3. 手动启动node02 的ResourceManagersbin/yarn-daemon.sh start resourcemanager

附录:

上一篇 下一篇

猜你喜欢

热点阅读