Hadoop相关知识
hadoop架构
Hadoop架构RHadoop: use R programming language to do statistic data processing
Mahout: machine learning tools
Hive and Pig: doing no SQL
sqoop: data in and out the system
Core hadoop
core hadoopHadoop的两个core的部分:
HDFS:hadoop distributed file system -> store data
MapReduce: Process data
Hadoop的衍生周边(software works with Hadoop together):
hadoop ecosystem
hadoop ecosystem: software work along with hadoop, design for making hadoop easier to use.
Writing MapReduce languages can be in Java, Python, Ruby and Perl, or even SQL
HIVE : HIVE interpretes SQL like, SELECT * FROM...
into MapReduce
PIG: enable to analyse data in simple script language rather than MapReduce
(Code is turned into MapReduce and run on the cluster)
Impala: with SQL, no need MapReduce, low latency queries. Run quickly than HIVE.
data input from outside HDFS
sqoop: takes data from traditional relational database, such as Microsoft's SQL Server, put to HDFS. So the data can be processed along with other dat on the cluster.
Flume: injects data as it's generated by external systems, puts into the cluster
HBase: a real time database, built on top of HDFS
Hue: graphical front-end to the cluster
oozie: a workflow management tool
Mahout: a machine learning library
Cloudera has a distribution of HADOOP, called CDH (free and open source), put together the tools in the Hadoop ecosystem.
HDFS and MapReduce
HDFS:
store one large file which means store large data into several blocks
DataNode: a block to store data, a cluster (a HDFS) has several blocks (could be several DataNodes)
NameNode: a demon store metadata about which blocks make up the original file
HDFS content
HDFS content
HDFS会出问题的点
HDFS会出问题的点When DataNode Fails
-
HDFS make replications for each block, every block stores in HDFS 3 times. So if one DataNode fails, other DataNodes can provide backup and re-replicate again for the blocks in the failed DataNode.
When NameNode Fails
会有single point failure问题So, here is NFS (network file system). Store Metadata on a remote disk. If NameNode lost all data, there would be a copy of the metadata on the network.
2 NameNode
Hadoop 的基本操作命令
Manipulate by Unix like commend
In terminal,
hadoop fs -ls //show all files infomation
hadoop fs -put purchases.txt //put purchases.txt into HDFS
hadoop fs -tail purchases.txt //show last few lines of purchases.txt
hadoop fs -cat purchases.txt //show entire contents of the file
hadoop fs -mv purchases.txt newname.txt //rename
hadoop fs -rm newname.txt // delete txt file
hadoop fs -mkdir myinput // create a directory in HDFS named myinput
hadoop fs -put purchases.txt myinput //upload txt to the new directory
MapReduce
File divide into chunks and then process in Parallel
Store in
<Key, Value>
pairskey ,value pair problem
MapReduce process
How can the final results to be in a sorted order?
final result 怎么才能是有序的?那么问题就又来了
If there is only 2 Reducer, which keys go to the first reducer?
image.png
Don't know. Because there is no guarantees that each reducers can get the same number of keys. It might be that one will get none.
Deamons of MapReduce
It is alike the relation of NameNode and DataNode
Job is submitted to Job Tracker, that splits the work into mappers and reducers. The Task Trackers runs in the same machine as the DataNodes. If all DataNodes have the green block are busy, then another DataNode will be chosen to process the green block, it will be streamed over the network. (rather rarely)
The Mappers will read their input data and they will produce intermediate data which the Hadoop framework will pass to the reducers (shuffle and sort). Then the reducers process that data and write their final output back to HDFS.
Job Tracker & Task Trackers
code for running a job
Java/python
Because of Hadoop Streaming, the code can be written in much any languages.
Configure a single cluster
还是官方的教程靠谱!!
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/SingleCluster.html
image.png
image.png
image.png