2018-01-24 6 HDFS Architecture a

2018-01-25 本文已影响0人鸭鸭学语言

Architecture

Summary:

HDFS is a scalable distributed filesystem. Haddoop distrubutes the big data as block on local data which is closed by compute. Nodes consists of heterogeneous low price commodity hardware.

key point of design:

distribute data as block to scalable data nodes.

feature:

Data high availability is by data replication in different nodes.

Simplified coherency model - once write many read.

move computation close to data

Relax POSIX requirements - increase thoughput

Achitecture:

Name Node - manage the file system namespace and regulates the access to files by clients.

Data Nodes - manage storage; serve read/write requests from clients; block creation\deletion\replication based on instructions from Name Node.

Performance Envelope

Every block has represented as a object.

default block size is 64MB. The file size depends on how many blocks to create, then :

impact the memory usage and netowork load from the perspective of namespace

impact the number of map task which process block, even further the disk IO performance.

How to improve performance:

- merge small file

- sequence files

- HBASE, HIVE configuration

- CombineFileInputFormat

Write/Replication/Read Processes on HDFS

initially, data is cached at client buffer until it reaches a block size. then:

lesson 6 - slides

HDFS command list

HDFS Architect (official document)

2018-01-24 6 HDFS Architecture a

Architecture

Performance Envelope

Write/Replication/Read Processes on HDFS

猜你喜欢

热点阅读