apache hadoop 3.0新特性

2019-12-21 本文已影响0人爱摄影Sure

Apache Hadoop 3.0.0

Apache Hadoop 3.0.0 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).

apache Hadoop 3.0基于之前的发行版（hadoop2）进行了许多重大的改进。

This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready.

这个release是GA版本，这就意味着这个版本表现出健壮的api和可用于准生产环境

Overview

Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.

这里鼓励用户去读完整的release说明。本页提供一些主要升级的概述。

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.

所有的hadoop jars现在都是由java8编译打包的。仍然使用java7或者更低版本的用户必须升级到java8

Support for erasure coding in HDFS

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

纠删码是对于冷数据采取空间优先的思路，相比于副本存储，大大节省了空间。像rs（10，4）这种方式存储，比标准的三副本要节省1.4倍空间。

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

因为纠删码会强制使用在数据重建和远程读取上面，所以纠删码主要用很少访问的冷数据上面。用户在构建这个特性时，应该考虑带宽和cpu在纠删码上的消耗

More details are available in the HDFS Erasure Coding documentation.

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

我们正介绍YARN Timeline Service: v.2的主要版本，YARN Timeline Service v.2地址有两个主要的挑战，提高Timeline Service的扩展性和可靠性，增强流处理和聚合的性能

YARN Timeline Service v.2 alpha 2 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity.

本次仅提供了YARN Timeline Service v.2 alpha 2，用于用户和开发者测试体验并提供反馈和建议来让v.2替代v.1。此版本仅应以测试能力使用

More details are available in the YARN Timeline Service v.2 documentation.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

hadoop shell 脚本已经被重写来修复长期存在的bug，并且加入了一些新特性。在着眼于兼容性的同事，一些改变可能会破坏已经安装的系统。

Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.

More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application

hadoop-client Maven artifact 提供的是 2.x 版本，这会添加到hadoop的classpath中。这导致版本冲突。

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.

HADOOP-11804 添加了新的hadoop-client-api 和hadoop-client-runtime 依赖并且通过shade将其变为独立的jar包。这就避免了在classpath下面的冲突

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

引用了ExecutionType 的概念，application可以申请机会容器。容器可以被非配到某一节点上面，即使这个节点此时没有资源。在这种情况下，这些容器会加入到当前nodemanager节点上面的队列中，等待获取资源来用于执行。机会容器有更低的优先级相比于默认的Guaranteed 容器，因此，当系统需要的时候，Guaranteed 容器可以抢占机会容器。这样可以保证集群的利用率。

Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol interceptor.

机会容器默认由RM分配。但是还添加了由实现了AMRMProtocol 的拦截器来调度分配机会容器

Please see documentation for more details.

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

MapReduce 已经增加了动态的map output 收集器实现。对于shuffle集中的jobs，这会提高30%或者更高的性能。

See the release notes for MAPREDUCE-2841 for more detail.

Support for more than 2 NameNodes.

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

过去，hadoop很多服务的默认端口集中制linux的临时端口范围（32768-61000）。这就意味着在启动服务的时候，这些端口可能被其他服务占用

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.

这些冲突端口已经被移出临时端口范围，涉及到NameNode, Secondary NameNode, DataNode, and KMS服务。我们的稳定已经进行适当的更新。

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

hadoop现在支持整合Microsoft Azure Data Lake 和Aliyun Object Storage System作为可选的hadoop兼容系统。

Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

一个单独的DataNode节点管理多块磁盘。在正常的写操作期间，磁盘甚至会挂掉。然而，增加或者替代磁盘可能导致DataNode节点的数据倾斜。这种情况是不被现有的HDFS balancer处理的。现有的balancer 只能处理DataNode节点之间的数据倾斜。

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guide for more information.

这种情况被新的内部的DataNode 命令行diskbalancer 解决。

Reworked daemon and task heap management

A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

hadoop daemons和MapReduce任务的堆管理器进行了一系列的改变

HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated. See the full release notes of HADOOP-10950 for more detail.

HADOOP-10950介绍了新的配置daemon堆内存的方式。值得注意的是，基于节点堆存大小自动调节和基于 HADOOP_HEAPSIZE 变量这种方式已经被移出了。

MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.

MAPREDUCE-5785简化配置map和reduce task的堆内存。因此，申请堆内存不在需要配置task配置文件和java参数配置。现有的已经指定两者配置的不受这个改变影响。

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

HADOOP-13345增加了一个可选的s3客户端特性：使用DynamoDB 最为快速、一致性的文件&元数据目录的存储。

See S3Guard for more details.

HDFS Router-Based Federation

HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

HDFS Router-Based Federation增加了一个rpc路由层，该层可以提供多个hdfs命名空间的联邦视图。这与现存的viewFs和HDFS 联邦功能很相似。除了挂载表是在服务端的路由层而不是客户端。这简化了对于现有的hdfs客户端访问集群。

See HDFS-10467 and the HDFS Router-based Federation documentation for more details.

API-based configuration of Capacity Scheduler queue configuration

The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

OrgQueue 扩展了为调度器的容量提供了通过rest API 程序化的方式来改变配置。用户可以通过调用来调整队列的配置。这允许超级用户自动的管理配置，通过在队列设置administer_queue ACL

See YARN-5734 and the Capacity Scheduler documentation for more information.

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

yarn的资源模型已经被通用化来支持在cpu内存之上的用户自定义的可计数资源类型。例如，集群的超级用户能够定义像GPU，软件licenses，或者本地存储。yarn的任务能够基于这些资源进行调度。

See YARN-3926 and the YARN resource model documentation for more information.

Getting Started

The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.