020 Hadoop 3 中有什么新功能?探索 Hadoop
020 What is New in Hadoop 3? Explore the Unique Hadoop 3 Features
The release of Hadoop 3.x is the next big milestone in the line of Hadoop releases. Many people have a question in mind about what feature enhancement does Hadoop 3.x gives over Hadoop 2.x. So in this blog, we will take a look at what is new in Hadoop 3 and how it differs from the old versions.
Hadoop 3.X 的发布是 Hadoop 发布系列中的下一个重要里程碑.很多人对什么功能增强有疑问Hadoop 3.X 放弃 Hadoop 2.X.因此,在这个博客中,我们将了解 Hadoop 3 中的新功能以及它与旧版本的不同之处.
What is New in Hadoop 3? Explore the Unique Hadoop 3 FeaturesWhat is New in Hadoop 3? Explore the Unique Hadoop 3 Features
Hadoop QuizWhat’s New in Hadoop 3?
Hadoop 3 有什么新功能?
Below are the 10 changes which are done in Hadoop 3 and that makes it unique and fast. Have a look the What’s new in Hadoop 3.x –
下面是 Hadoop 3 中完成的 10 个更改,这使得它变得独特和快速.看看 Hadoop 3..x 中的新功能
You must check – The Essential Guide to Learn Hadoop 3
你必须检查一下学习 Hadoop 3 的基本指南
1. The minimum version of Java supported in Hadoop 3.0 is JDK 8.0
Hadoop 1. 最低版本的 Java 支持 3.0 的 JDK 8.0
They have compiled all the Hadoop jar files using Java 8 run time version. The user now has to install Java 8 to use Hadoop 3.0. And user having JDK 7 has to upgrade it to JDK 8.
他们使用 Java 8 运行时版本编译了所有 Hadoop jar 文件.为了使用 Hadoop 3.0,用户现在必须安装 Java 8.拥有 JDK 7 的用户必须将其升级到 JDK 8..
2. HDFS Supports Erasure Coding
2. HDFS 支持擦除编码
Hadoop 3.x uses erasure coding for providing fault tolerance. Hadoop 2.x uses a replication technique to provide the same level of fault tolerance. Let us explore the difference between the two.
Hadoop 3.X 使用擦除编码来提供容错.Hadoop 2.X 使用复制技术来提供相同级别的容错.让我们来探讨一下两者的区别.
First, we will look at replication. Let us take the default replication factor of 3. In this, for 6 blocks we have to store a total of 6*3 i.e. 18 blocks. For every block replicated the storage overhead is 100%. Hence in our case, the storage overhead will be 200%.
首先,我们来看一下复制.让我们把默认的复制因子取 3.,在这里,对于 6 个块,我们必须总共存储 6*3 个,即 18 个块.对于复制的每个块,存储开销为 100%.因此,在我们的案例中,存储开销将为 200%.
Let us see what happens in Erasure Coding. For 6 blocks, 3 parity blocks get calculated. We call this process as encoding. Now whenever a block gets missing or corrupted, it gets calculated from the remaining blocks and parity blocks. We call this process as decoding. In this case, we have a total of 9 blocks stored for 6 blocks making 50% storage overhead. Hence we can achieve the same amount of fault tolerance with much lesser storage. But there always overhead in terms of CPU and network for the process of encoding and decoding. Thus it uses for rarely access data.
让我们看看擦除编码会发生什么.对于 6 个块,计算出 3 个奇偶校验块.我们把这个过程称为编码.现在,每当一个块丢失或损坏时,它都会从剩余的块和奇偶校验块中计算出来.我们把这个过程称为解码.在这种情况下,我们总共为 6 个块存储了 9 个块,这造成了 50% 的存储开销.因此,我们可以用更少的存储来实现相同数量的容错.但是,在编码和解码过程中,CPU 和网络总是有开销.因此,它很少用于访问数据.
What is new in Hadoop 3?Recommended Reading – Hadoop Master-Slave Architecture
推荐阅读- Hadoop 主从架构
3. YARN Timeline Service v.2
3. 纱时间服务 v.2
Yarn Timeline Service is new in Hadoop 3. Timeline server is responsible for storage and retrieval of the application’s current and historical information. This information is of two types –
Yarn 时间线服务是 Hadoop 3. 中的新服务,Timeline server 负责存储和检索应用程序的当前和历史信息.这个信息有两种类型
Generic information of the completed application
已完成申请的一般资料
-
Name of the queue
-
User information
-
Number of attempts per application
-
Information about containers which ran for each attempt
-
Generic data stored by ResourceManager about a completed application which is accessed by web UI.
-
队列的名称
-
用户信息
-
每个应用程序的尝试次数
-
关于每次尝试运行的容器的信息
-
Resource cemanager 存储的关于 web UI 访问的已完成应用程序的通用数据.
Per framework information about running and completed application
关于运行和完成的应用程序的每个框架信息
-
Number of map tasks
-
Number of reduce task
-
Counters
-
Information published by application developers to TimeLine Server via Timeline client.
-
地图任务数量
-
减少任务的数量
-
柜台
-
应用程序开发人员通过时间线客户端发布到时间线服务器的信息.
This data gets queried by REST API for rendering by application or framework specific UI.
REST API 会查询这些数据,以便根据应用程序或框架特定的 UI 进行渲染.
The TimeLine server v.2 addresses major shortcomings in version v.1. One of the issues is scalability. The TimeLine server v.1 has a single instance of reader/writer and storage. It is not scalable beyond a few numbers of nodes. Whereas in version v.2 Timeline server has a distributed writer architecture and scalable backend storage. It separates collection (writer) of data from serving (read) of data. Also, it uses one collector per** YARN application**. It has a reader as a separate instance which servers query request via REST API. Timeline server v.2 uses HBase for storage which can get scaled to huge size giving good response time for reads and writes.
时间线服务器 v.2 解决了 v.1 版本中的主要缺点.其中一个问题是可扩展性.时间线服务器 v.1 具有阅读器/写入器和存储的单个实例.它不能扩展到几个节点之外.而在 v.2 版本中,Timeline server 具有分布式 writer 体系结构和可扩展的后端存储.它将数据的收集 (写入器) 与数据的服务 (读取) 分开.此外,它使用每个收集器一个纱线应用.它有一个阅读器作为一个单独的实例,服务器通过 REST API 查询请求.使用的时间线服务器HBase 的存储它可以扩展到巨大的大小,为读取和写入提供了良好的响应时间.
4. Support for Opportunistic Containers and Distributed Scheduling
集装箱 4. 支持的机会和分布式调度
Hadoop 3 has introduced the concept of execution type. If there are no resources available at the moment then these containers wait at the NodeManager. Opportunistic containers have low priority than Guaranteed containers. If suppose Guaranteed containers arrive in the middle of the execution of opportunistic containers then later gets preempted. This happens to make room for Guaranteed containers.
Hadoop 3 引入了执行类型的概念.如果目前没有可用的资源,那么这些容器会在 NodeManager 等待.机会集装箱比保证集装箱优先级低.如果假设保证容器在机会容器执行过程中到达,那么以后就会被抢占.这恰好为有保证的容器腾出了空间.
5. Support for More Than Two NameNodes
5. 支持两个多 NameNodes
Till now Hadoop supported single active NameNode and single standby NameNode. Having edits replicated to three journal nodes, this architecture allowed for the failure of one NameNode.
直到现在Hadoop支持单个活动名称节点和单个备用名称节点.将编辑复制到三个日志节点后,此体系结构允许一个名称节点失败.
But some situation requires a high level of fault tolerance. By configuring five journal nodes we can have a system of three NameNodes. Such a system would tolerate the failure of two NameNodes. Thus by introducing support for more than two NameNode** Hadoop 3.0 has made the system more highly available**.
但是有些情况下需要很高的容错能力.通过配置五个日志节点,我们可以有三个命名节点的系统.这样的系统可以容忍两个名字的失败.因此,通过引入对两个以上 NameNode 的支持Hadoop 3.0 提高了系统的可用性.
6. Default Ports of Multiple Services Changes
6. 默认端口的多种服务的变化
Previous to Hadoop 3.0 many Hadoop services had their default port in Linux ephemeral port range (32768-61000). Due to this, many times these services would fail to bind at startup. As they would conflict with other application.
在 Hadoop 3.0 之前,许多 Hadoop 服务的默认端口在 Linux 临时端口范围内 (32768-61000).因此,很多时候这些服务在启动时无法绑定.因为它们会与其他应用程序冲突.
They have moved the default port of these services out of ephemeral range. The services include NameNode, Secondary NameNode, DataNode, and KeyManagementServer.
他们已经将这些服务的默认端口移出了短暂的范围.服务包括、复制指令、中学、复制指令,DataNode,KeyManagementServer.
7. Intra-DataNode Balancer
内 7. DataNode 平衡器
A DataNode Manges many disks. During a write operation, these disks get filled evenly. But when we add or remove the disk it results in a significant skew. The HDFS balancer addresses internode data skew and not intra node.
DataNode 管理许多磁盘.在写操作期间,这些磁盘被均匀地填充.但是,当我们添加或删除磁盘时,会导致严重的倾斜.的HDFS 平衡器解决节点间数据倾斜,而不是节点内数据倾斜.
Intra-node balancer addresses this situation. The CLI – hdfs diskbalancer invokes this balancer.
节点内平衡器解决了这种情况.CLI-hdfs 磁盘平衡器调用此平衡器.
8. Daemon and Task Heap Management Reworked
堆 8. 后台程序和任务管理修改
There are a number of changes in the Heap management of daemons and Map-Reduce tasks:
守护进程的堆管理有许多变化地图-减少任务:
There are new ways to configure daemon heap sizes. The system auto-tunes based on the memory of the host. HADOOP_HEAPSIZE variable is no longer used. In its place, we have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. Also, they have removed the internal variable JAVA_HEAP_SIZE . They have also removed default heap sizes which allows for auto-tuning by JVM. All the variables of global and daemon heap size support units. If the variable is only a number then it expects the size to be in megabytes. Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.
配置守护进程堆大小有新的方法.系统根据主机的内存自动调谐.不再使用 hadoop _ heapsize 变量.我们有 heap _ max _ size 和 heap _ min _ size 变量.此外,他们还删除了内部变量 java _ heap _ size.它们还删除了允许 JVM 自动调整的默认堆大小.全局和守护进程堆大小支持单元的所有变量.如果变量只是一个数字,那么它预计大小将以兆字节为单位.此外,如果要启用旧的默认值,请在 hadoop-env.sh 中配置 hadoop _ heapsize_max.
If the value for mapreduce.map/reduce.memory.mb is set to the default of -1. Then it will automatically infer the value from Xmx variable specified for mapreduce.map/reduce.java.opts. Xmx is nothing but heap size value system property. This reverse is also possible. Suppose Xmx value is not specified for mapreduce.map/reduce.java.opts keys. The system derives its value from mapredcue.map/reduce.memory.mb keys. If we don’t specify either value then the default is 1024MB. For configuration and job code which specify this value explicitly will not get affected.
如果 mapreduce.map/reduce.memory.mb 的值设置为 1. 1.然后,它将自动从为 mapreduce.map/reduce.java.opts 指定的 Xmx 变量中推断值.Xmx 只是堆大小值系统属性.这也是可能的.假设没有为 mapreduce.map/reduce.java.opts 键指定 Xmx 值.系统从 mapredcue.map/reduce.memory.mb 键导出其值.如果我们没有指定任何一个值,那么默认值为 1024 MB.对于明确指定此值的配置和作业代码不会受到影响.
Have a look at Hadoop Ecosystem and its Components
9. Generalization of Yarn Resource Model
9. 推广纱资源模型
They have generalized the Yarn resource model to include user-defined resources apart from CPU and memory. These user-defined resources can be software licenses, GPU or locally attached storage. Yarn tasks gets scheduled on the basis of these resources.
他们将 Yarn 资源模型推广到除了 CPU 和内存之外的用户定义资源.这些用户定义的资源可以是软件许可证、 GPU 或本地附加存储.根据这些资源安排纱线任务.
We can extend the Yarn resource model to include arbitrary “countable” resources. A countable resource is one which gets consumed by the container and the system releases it after completion. Both CPU and memory are countable resources. Likewise, GPUs or Graphics Processing Unit and software licenses are countable resources too. Yarn tracks CPU and memory for each node, application, and queue by default. Yarn can extend to track other user-defined countable resources like GPUs and software licenses. The integration of GPUs with containers has enhanced the performance of Data Science and AI use cases.
我们可以将纱线资源模型扩展到任意的 “可数” 资源.可数资源是容器消耗的资源,系统在完成后释放它.CPU 和内存都是可数的资源.同样,gpu 或图形处理器和软件许可证也是可数的资源.默认情况下,Yarn 会跟踪每个节点、应用程序和队列的 CPU 和内存.Yarn 可以扩展到跟踪 gpu 和软件许可证等其他用户定义的可数资源.Gpu 与容器的集成增强了数据科学和人工智能用例.
10. Consistency and Metadata Caching for S3A Client
10.S3A 客户端的一致性和元数据缓存
S3A client of now has the capability to store metadata for files and directories in a fast and consistent way. It does this by using a DynamoDB table. We can refer to this new feature as S3GUARD. It caches the directory information so that S3Aclient can get faster lookups. Also, it provides resilience to inconsistencies between S3 list operations and status of the object. When the files get created using S3GUARD we can always find it. S3GUARD is experimental and we can consider it as unstable.
Now 的 S3A 客户端能够以快速一致的方式存储文件和目录的元数据.它通过使用动态 odb 表来实现这一点.我们可以将这个新功能称为 S3GUARD.它缓存目录信息,以便 S3Aclient 可以更快地查找.此外,它还为 S3 列表操作和对象状态之间的不一致提供了弹性.当使用 S3GUARD 创建文件时,我们总是可以找到它.S3GUARD 是实验性的,我们可以认为它是不稳定的.
So, we have explored many new features of Hadoop 3 that makes it unique and popular.
因此,我们已经探索了 Hadoop 3 的许多新特性,这些特性使得它变得独特和流行.
Summary
总结
As we have progressed along different versions of Hadoop, it gets better and better. The developers have incorporated many changes to fix bugs, make it more user-friendly and give it enhanced features. The changes made in default ports of various Hadoop services has made it more convenient to use. Hadoop includes various feature enhancements like erasure coding, the introduction of timeline service v.2, adoption of the intra-node balancer and so on. These changes have increased the chances of use of Hadoop by the industry. You must read **top Hadoop questions related to the latest version of Hadoop. **
随着我们在不同版本的 Hadoop 上的进步,它变得越来越好.开发人员已经整合了许多修改来修复错误,使其更加用户友好,并为其提供增强的功能.各种 Hadoop 服务的默认端口所做的更改使得使用起来更加方便.Hadoop 包括各种功能增强像擦除编码,时间线服务 v.2 的引入,节点内平衡器的采用等等.这些变化增加了业界使用 Hadoop 的机会.你必须读**与最新版本 Hadoop 相关的热门 Hadoop 问题. **
Share your feedback of reading what’s new in Hadoop 3 via comments.
通过评论分享你阅读 Hadoop 3 中的新功能的反馈.