Introduction of Hadoop/ MapReduc
What is MapReduce ?
Parallel programming model for big data processing:
split data> chunks
define steps to process chunks
process the chunks parallelly
Hadoop is a platform implements MapReduce .
1. Map
<key1, value1> -> <key2, value2>
eg: <line#, text string > -> < word, count>
After mapping, the oupput is passed to Reduce part
2. Reduce
Merge/Reduce the output of Mapping phase, which is optional .
The output of MapReduce could be printed, Summed, Counted , loaded to DB or sent to next MapReduce job
data:image/s3,"s3://crabby-images/f510e/f510ebd8b113f056bcc77119f847ca46ab0eb577" alt=""
data:image/s3,"s3://crabby-images/fb53d/fb53d71fa653c633afa669ad2cae97fcc7dbcef9" alt=""
data:image/s3,"s3://crabby-images/8ba80/8ba80e13a5acead457f5866a466cded28394dce9" alt=""
data:image/s3,"s3://crabby-images/6c0b7/6c0b7a1e60ead630764896bbcd4988cb224e8d0b" alt=""
data:image/s3,"s3://crabby-images/4093d/4093d06ab5dba13f19fd46e811a3c319ccae9935" alt=""
Idea: MapReduce , massive unstructured data storage
Physical: Jave classes for and The Hadoop Distributed file System
Hadoop Operational Modes
Java MapReduce Mode: read record incrementally
Streaming Mode: Any language, input can be a line or stream
data:image/s3,"s3://crabby-images/0788d/0788dd698310ebdaa74e2911bd50901cf352650a" alt=""
Query Languages for Hadoop
Builds on core Hadoop to enhanve the development and manpulation of Hadoop cluster
Pig:Data flow language and execution enviroment
Hive(HiveQL) Query language based on SQL for building MapReduced jobs
HBase Column oriented database
data:image/s3,"s3://crabby-images/37295/37295240fb5623d238a71b676b4ea16b1ee09273" alt=""
Pig(Data flow language in Latin)
2 Execution environment modes:
Local flie system
MapReduce in Hadoop environment
Suitable for large dataset and batch processing
data:image/s3,"s3://crabby-images/73ab2/73ab219af9199e67f1c2810140875fa31e3f286b" alt=""