我爱编程

Introduction of Hadoop/ MapReduc

2017-09-09  本文已影响29人  张荣恩Sophia

What is MapReduce ?

Parallel programming model for big data processing:

split data> chunks

define steps to process chunks

process the chunks parallelly

    Hadoop is a platform implements MapReduce . 

1. Map

<key1, value1>  -> <key2, value2>

eg: <line#, text string >   -> < word, count>

After mapping, the oupput is passed to Reduce part

2. Reduce

Merge/Reduce the output of Mapping phase, which is optional .

The output of MapReduce could be printed, Summed, Counted , loaded to DB or sent to next MapReduce job

Idea: MapReduce , massive unstructured data storage

Physical: Jave classes for and The Hadoop Distributed file System

Hadoop Operational Modes

Java MapReduce Mode: read record incrementally

Streaming Mode: Any language, input can be a line or stream

MapReduce and HDFS

Query Languages for Hadoop

Builds on core Hadoop to enhanve the development and manpulation of Hadoop cluster

Pig:Data flow language and execution enviroment

Hive(HiveQL) Query language based on SQL for building MapReduced jobs

HBase  Column oriented database 

Pig(Data flow language in Latin)

2 Execution environment modes:

Local flie system

MapReduce in Hadoop environment

Suitable for large dataset and batch processing

上一篇 下一篇

猜你喜欢

热点阅读