2018-01-24 6 HDFS Architecture a
Architecture
Summary:
HDFS is a scalable distributed filesystem. Haddoop distrubutes the big data as block on local data which is closed by compute. Nodes consists of heterogeneous low price commodity hardware.
key point of design:
distribute data as block to scalable data nodes.
feature:
Data high availability is by data replication in different nodes.
Simplified coherency model - once write many read.
move computation close to data
Relax POSIX requirements - increase thoughput
Achitecture:
Name Node - manage the file system namespace and regulates the access to files by clients.
Data Nodes - manage storage; serve read/write requests from clients; block creation\deletion\replication based on instructions from Name Node.
Performance Envelope
Every block has represented as a object.
default block size is 64MB. The file size depends on how many blocks to create, then :
impact the memory usage and netowork load from the perspective of namespace
impact the number of map task which process block, even further the disk IO performance.
How to improve performance:
- merge small file
- sequence files
- HBASE, HIVE configuration
- CombineFileInputFormat
Write/Replication/Read Processes on HDFS
initially, data is cached at client buffer until it reaches a block size. then: