Hadoop建立单机模式和运行它的自带的例子

2019-08-25  本文已影响0人  波洛的汽车电子世界

三种安装模式:单机模式,伪分布式模式,分布式模式
Standalone模式是默认的,不需要修改配置文件(只改了Java_home),也不需要启动进程。
Pseudo-Distributed也是单机就能跑的,但必须改配置文件(改了Java_home,设置了pid在的文件夹,设置了core-site,hdfs-site),也要启动对应的进程。
Distributed是完全模式,需要多台主机,每台主机都需要合理配置并启动进程。

安装

  1. ssh localhost
    如果不行就生成密钥
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  1. brew install hadoop
  2. 看hadoop安装的版本,终端输入 hadoop version,可知安装在/usr/local/Cellar/hadoop/3.1.2下。
  3. 首先设置/usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop/hadoop-env.sh的JAVA_HOME。
    查看Java的安装路径:终端执行 /usr/libexec/java_home -V
    得到/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home
    将这个写入JAVA_HOME,将#去掉,得:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home

此时注意,如果是单机模式,只需要修改这个配置。千万不要修改完其他的配置后再去运行单机模式,不然会出现 connection refused的错误!
接下来就运行单机模式的案例了。

单机模式

来源:https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html
以上提供了grep案例。假设我们想在很多文件中找出以'dfs'开头的单词和个数。那么就需要一个放着文件的文件夹,这里首先新建了一个文件夹input,然后复制了hadoop的配置文件中所有xml格式的文件到input文件夹里。然后执行map-reduce计算,计算结果放在output里面,然后输出output。

运行

Huizhi$ etc Huizhi$ cd /usr/local/Cellar/hadoop/3.1.2/libexec
libexec Huizhi$ mkdir input #新建文件夹input
libexec Huizhi$ cp etc/hadoop/*.xml input #复制文件
libexec Huizhi$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+' #map-reduce计算
libexec Huizhi$ cat output/* #输出output

这里,我为了试验输出的结果,加了两个以dfs开头的单词。最后的结果是

$ cat output/*
1   dfstwo
1   dfsone
1   dfsadmin

'dfs[a-z.]+'的

出现了以下这个错误,是因为改动了其他配置,要么就修改恢复默认配置,要么像我一样,卸载重装:

2019-08-23 16:14:16,220 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-23 16:14:17,048 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:9000
java.net.ConnectException: Call From HuizhiXu.local/172.16.233.171 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

运行完这个例子之后我想看下系统自带的这个share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar都有哪些功能,运行

libexec Huizhi$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar 

会出现结果。
要想知道这些功能怎么运用,可以

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar  功能名
例如:hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar aggregatewordcount

就会出现

usage: inputDirs outDir [numOfReducer [textinputformat|seq [specfile [jobName]]]]
  1. aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    这个语句用来计算word在文本中出现的次数。
    注意:当输入为普通文本时会出现以下错误:
Caused by: java.io.IOException: file:/usr/local/Cellar/hadoop/3.1.2/libexec/input/capacity-scheduler.xml not a SequenceFile

因为该语句只能对二进制SequenceFile进行解析,需要hadoop 的api把普通文本转换成SequenceFile。

  1. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    这个语句用来绘制字数出现的直方图。
  2. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
对输入文件按正则表达式查找,把结果写到输出文件上
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
格式 :
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi map的个数 样本的个数
例子:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi 10 50
结果:

Job Finished in 2.973 seconds
Estimated value of Pi is 3.16000000000000000000

randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
这就是最著名的数单词字数的功能。
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Linux命令知识点:

  1. cd: cd命令用于切换当前工作目录至 dirName(目录参数)。
    格式:cd [dirName]
    "~" 也表示为 home 目录 的意思,"." 则是表示目前所在的目录,".." 则表示目前目录位置的上一层目录。
    例子:cd ~ cd ../..
    来源

  2. grep是一个最初用于Unix操作系统的命令行工具。在给出文件列表或标准输入后,grep会对匹配一个或多个正则表达式的文本进行搜索,并只输出匹配的行或文本。(来源:维基百科)

  3. mkdir: Linux mkdir命令用于建立名称为 dirName的文件夹。
    格式:mkdir dirName

  4. cp: 主要用于复制文件或目录。
    格式:cp [options] source dest
    options 常用的是 -r
    例子:$ cp –r test/ newtest 将test/下所有的文件和文件夹都复制到newtest下
    scp除了复制文件要输密码外,其余和cp是一样的

参考资料:
AggregateWordCount源代码注释
MapReduce的输入格式
https://docs.microsoft.com/bs-latn-ba/azure/hdinsight/hadoop/apache-hadoop-run-samples-linux?view=netcore-2.0

上一篇下一篇

猜你喜欢

热点阅读