我爱编程

HDFS简述

2018-08-08  本文已影响46人  须臾之北

The Hadoop Distributed Filesystem

1. Why HDFS ?

2. The Design of HDFS

2.1 优点

2.2 缺点

* 小文件存储的寻址时间会超过读取时间,它违反了 HDFS 的设计目标。
* 上传的文件过小,上传花费时间只有几秒,但是寻址时间过长也是不合适的(访问时间和传输时间达到某一比例,效率才最佳)  

Block(面试题)

Block抽象的好处
HDFS Architecture
image

3. Namenodes and Datanodes

Namenode

Datanode

4. The File System Namespace

Hadoop提供的两种复原namenode的机制

5. Data Replication

5.1 Replica Placement: The First Baby Steps

5.2 Replica Selection

5.3 Safemode

5.3.1 The Persistence of File System Metadata

5.4 Data Disk Failure, Heartbeats and Re-Replication

5.5 Metadata Disk Failure

6. Block Caching

7. HDFS Federation

8. HDFS High Availability

8.1 HA内容

  1. In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.(此时用到了一对namenodes)

  2. The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode(active namenode和standbynamenode共享edit log的存储)

  3. 对于高实用性共享存储有两种选择:NFS文件,QJM(quorum journal manager)。QJM专注于HDFS的实现,其唯一目的就是提供一个高实用性的可编辑日志,也是大多是HDFS安装时所推荐的。

  4. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.(Datanode同时向两个namenodes汇报block的情况)

  5. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace

  6. If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries
    and an up-to-date block mapping. (如果active namenode发生故障 ,standby namenode会迅速接管任务(在数秒内),因为在内存中备份节点有最新的可用状态,包括最新的可编辑日志记录和块映射信息。)

  1. 从活动主节点到备份节点的故障切换是由系统中一个新的实体——故障切换控制器来管理的。虽然有多种版本的故障切换控制器,但是hadoop默认的是ZooKeeper,它也可确保只有一个namenode是处于活动状态。每一个namenode节点上都运行一个轻量级的故障切换控制器进程,它的任务就是去监控namenode的故障,一旦namenode发生故障,它就会触发故障切换。

  2. HA的实现会竭尽全力的去确保之前的活动主节点不会做出任何导致故障的有害举动,这个方法就是fencing。

9. The Java Interface

Reading Data from a Hadoop URL

10. Data Flow——Anatomy of a File Read

image

11. Network Topology and Hadoop

image

11. Data Flow——Anatomy of a File Write

image
上一篇 下一篇

猜你喜欢

热点阅读