大数据

000 Hadoop 生态系统及其组件 - 完整教程

2019-07-29  本文已影响0人  胡巴Lei特

000 Hadoop Ecosystem and Their Components – A Complete Tutorial

1. Hadoop Ecosystem Components

1. Hadoop 生态组件

The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful and due to which several Hadoop job roles are available now. We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive,** Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift**, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data Hadoop and to acquire master level knowledge of the Hadoop Ecosystem.

Apache Hadoop 生态系统组件教程概述了 Hadoop 生态系统的不同组件,这些组件使得 Hadoop 如此强大. 我们还将了解 Hadoop 生态系统组件,如 HDFS and HDFS components, MapReduce, YARN, Hive,** Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift**, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie 深入研究大数据 Hadoop,掌握 Hadoop 生态系统的知识.

Hadoop Ecosystem and Their Components

Hadoop Ecosystem and Their Components

Hadoop Quiz

2. Introduction to Hadoop Ecosystem

2. Hadoop 的生态系统

As we can see the different Hadoop ecosystem explained in the above figure of Hadoop Ecosystem. Now We are going to discuss the list of Hadoop Components in this section one by one in detail.

正如我们可以看到 Hadoop 生态系统上图中解释的不同 Hadoop 生态系统. 现在,我们将在本节中逐一详细讨论 Hadoop 组件.

2.1. Hadoop Distributed File System

2.1.分布式文件系统

It is the most important component of Hadoop Ecosystem.** HDFS** is the primary storage system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed filesystem that runs on commodity hardware. HDFS is already configured with default configuration for many installations. Most of the time for large clusters configuration is needed. Hadoop interact directly with HDFS by shell-like commands.

它是 Hadoop 生态系统最重要的组成部分. HDFS 是 Hadoop 的主要存储系统. Hadoop 分布式文件系统 (HDFS) 是一种基于 java 的文件系统,为用户提供可扩展、容错、可靠且经济高效的数据存储. 大数据. HDFS 是一个运行在普通硬件上的分布式文件系统.HDFS 已经为许多安装配置了默认配置.大型集群配置的大部分时间是需要的.Hadoop 通过类似 shell 的命令直接与 HDFS 交互.

HDFS Components:

There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss these Hadoop HDFS Components-

有两个主要组成部分的 Hadoop HDFS NameNode 和 DataNode.现在我们来讨论一下 Hadoop 的 HDFS 组件

i. NameNode

It is also known as* Master* node. NameNode does not store actual data or dataset. NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is stored and other details. It consists of files and directories.

它也被称为 Master 节点. NameNode 不存储实际数据或数据集. NameNode 存储元数据,即blocks,它们的位置、存储数据的机架和其他详细信息. 它由文件和目录组成.

Tasks of HDFS NameNode

ii. DataNode

It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS. Datanode performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2 files on the file system. The first file is for data and second file is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and does handshaking. Verification of namespace ID and software version of DataNode take place by handshaking. At the time of mismatch found, DataNode goes down automatically.

它也被称为Slave .HDFS HDFS Datanode 负责实际数据的存储. Datanode 根据客户的要求响应请求 读写操作 . Datanode 的副本块由文件系统上的 2 个文件组成. 第一个文件用于数据,第二个文件用于记录块的元数据. HDFS 元数据包括数据校验和启动时,每个 Datanode 都连接到相应的 Namenode,并进行握手. 通过握手验证 DataNode 的命名空间 ID 和软件版本. 发现不匹配时,DataNode 会自动下线.

Tasks of HDFS DataNode

This was all about HDFS as a Hadoop Ecosystem component.

作为 Hadoop 生态系统组件,这都是关于 HDFS 的相关文档

Refer HDFS Comprehensive Guide to read Hadoop HDFS in detail and then proceed with the Hadoop Ecosystem tutorial.

参考 HDFS 综合指南 详细阅读 Hadoop HDFS,然后继续学习 Hadoop 生态系统教程.

2.2. MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.

Hadoop MapReduce 是提供数据处理的核心 Hadoop 生态系统组件. MapReduce 是一种软件框架,用于轻松编写处理存储在 Hadoop 分布式文件系统中的大量结构化和非结构化数据的应用程序.
MapReduce 程序本质上是并行的,因此对于使用集群中的多台机器执行大规模数据分析非常有用. 从而提高了集群并行处理的速度和可靠性.

image

Hadoop MapReduce

Working of MapReduce

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

Hadoop 生态系统组件 “mapreduc” 将处理分为两个阶段:

Each phase has key-value pairs as input and output. In addition, programmer also specifies two functions:** map function and reduce function**

每个阶段都有 键值对作为输入和输出. 另外,程序员还指定了两个功能: map functionreduce function

Map function takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Read Mapper in detail.
Reduce **function **takes the output from the Map as an input and combines those data tuples based on the key and accordingly modifies the value of the key. Read Reducer in detail.

map function 获取一组数据,并将其转换为另一组数据,其中单个元素被分解为元组 (键/值对).

reduce function 将来自 map 的输出作为输入,并根据键组合这些数据元组,从而修改键的值.

Features of ****MapReduce

Refer MapReduce Comprehensive Guide for more details.

参考 MapReduce 综合指南 更多细节

Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.

希望 Hadoop 生态系统的讲解对您有所帮助.我们要的下一个部件是 Yarn.

2.3. YARN

Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides the resource management. Yarn is also one the most important component of Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for managing and monitoring workloads. It allows multiple data processing engines such as real-time streaming and batch processing to handle data stored on a single platform.

Yanr 是提供资源管理的 Hadoop 生态系统组件. Yarn 也是 Hadoop 生态系统最重要的组成部分之一.YARN 被称为 Hadoop 的操作系统,因为它负责管理和监控工作负载.它允许实时流和批处理等多种数据处理引擎处理存储在单个平台上的数据.

Hadoop Yarn Diagram

Hadoop Yarn Diagram

YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:

Refer YARN Comprehensive Guide for more details.

2.4. Hive

The Hadoop ecosystem component,** Apache Hive,** is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data summarization, query, and analysis.

Hadoop 生态系统组件 Apache Hive 是一个用于查询和分析存储在 Hadoop 文件中的大型数据集的开源数据仓库系统.Hive 主要做三个功能:数据汇总、查询和分析.

Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop.

Hive 使用的语言叫做 HiveQL(HQL),类似于 SQL. HiveQL 自动将类似 SQL 的查询转换为MapReduce 作业将在 Hadoop 上执行.

image.png

Hive Diagram

Main parts of Hive are:

Refer Hive Comprehensive Guide for more details.

2.5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses ***PigLatin ***language. It is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. For Programs execution, pig requires Java runtime environment.

Apache Pig 是一个用于分析和查询存储在 HDFS 中的巨大数据集的高级语言平台.Pig 作为 Hadoop 生态系统使用的组件 PigLatin 语言.它和 SQL 非常相似.它加载数据,应用所需的过滤器,并以所需的格式转储数据.对于程序执行,pig 需要 Java 运行时环境.

Pig Diagram

Pig Diagram

Features of Apache Pig:

Refer Pig – A Complete guide for more details.

2.6. HBase

Apache HBase is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase, provide real-time access to read or write data in HDFS.

Apache HBase是一个 Hadoop 生态系统组件,它是一个分布式数据库,旨在将结构化数据存储在可能有数十亿行和数百万列的表中.HBase 是基于 HDFS 构建的可扩展、分布式和 NoSQL 数据库.HBase,提供在 HDFS 中读取或写入数据的实时访问.

HBase Diagram

HBase Diagram

Components of Hbase

Hbase 的组成部分

There are two HBase Components namely- HBase Master and RegionServer.

有两个 HBase 元件 ─ HBase 主 RegionServer.

i. HBase Master

It is not part of the actual data storage but negotiates load balancing across all RegionServer.

它不是实际数据存储的一部分,而是跨所有区域服务器协商负载平衡.

ii. RegionServer

It is the worker node which handles read, writes, updates and delete requests from clients. Region server process runs on every node in Hadoop cluster. Region server runs on HDFS DateNode.

它是处理来自客户端的读、写、更新和删除请求的工作节点.Region server 进程在 Hadoop 集群中的每个节点上运行.HDFS DateNode 域服务器上运行.

Refer HBase Tutorial for more details.

参考HBase 入门教程更多细节

2.7. HCatalog

It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

它是 Hadoop 的表和存储管理层.HCatalog 支持 Hadoop 生态系统中可用的不同组件,如 MapReduce 、 Hive 和 Pig,以便从集群中轻松读写数据.HCatalog 是 Hive 的一个关键组件,它使用户能够以任何格式和结构存储数据.
默认情况下,HCatalog 支持 RCFile 、 CSV 、 JSON 、序列文件和 ORC 文件格式.

Benefits of HCatalog:

HCatalog 的好处:

2.8. Avro

Acro is a part of Hadoop ecosystem and is a most popular Data serialization system. Avro is an open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Big data can exchange programs written in different languages using Avro.

Acro 是 Hadoop 生态系统的一部分,是最受欢迎的数据序列化系统. Avro 是为 Hadoop 提供数据序列化和数据交换服务的开源项目.这些服务可以一起使用,也可以独立使用.大数据可以使用 Avro 交换用不同语言编写的程序.

Using serialization service programs can serialize data into files or messages. It stores data definition and data together in one message or file making it easy for programs to dynamically understand information stored in Avro file or message.

使用序列化服务程序可以将数据序列化为文件或消息.它将数据定义和数据存储在一条消息或文件中,使得程序能够动态理解存储在 Avro 文件或消息中的信息.

Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files may be processed later by any program.

Avro 模式 它依赖于序列化/反序列化的模式. Avro 需要数据写入/读取的模式. 当 Avro 数据存储在文件中时,它的模式与它一起存储,这样任何程序都可以在以后处理文件.

Dynamic typing – It refers to serialization and deserialization without code generation. It complements the code generation which is available in Avro for statically typed language as an optional optimization.

动态打字- 它指的是没有代码生成的序列化和反序列化.它补充了 Avro 中静态类型语言作为可选优化的代码生成.

Features provided by Avro<u>:</u>

2.9. Thrift

It is a software framework for scalable cross-language services development. Thrift is an interface definition language for RPC(Remote procedure call) communication. Hadoop does a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift for performance or other reasons.

它是一个可扩展的跨语言服务开发的软件框架.Thrift 是 RPC (远程过程调用) 通信的接口定义语言.Hadoop 进行了大量 RPC 调用,因此出于性能或其他原因,有可能使用 Hadoop 生态系统组件 Apache Thrift.

Thrift Diagram

Thrift Diagram

2.10. Apache Drill

The main purpose of the Hadoop Ecosystem Component is large-scale data processing including structured and semi-structured data. It is a low latency distributed query engine that is designed to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed SQL query engine that has a schema-free model.

Hadoop 生态系统组件的主要目的是包括结构化和半结构化数据在内的大规模数据处理.它是一个低延迟的分布式查询引擎,旨在扩展到数千个节点,查询数 pb 的数据.演练是第一个具有无模式模型的分布式 SQL 查询引擎.

Application of Apache drill

The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of record and execute queries.

该演练已成为 cardlytics 的宝贵工具,该公司为移动和互联网银行提供消费者购买数据. Cardlytics 正在使用 drill 快速处理数万亿条记录并执行查询.

Features of Apache Drill:

The drill has specialized memory management system to eliminates garbage collection and optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse their existing Hive deployment.

Drill 拥有专门的内存管理系统,可以消除垃圾收集,优化内存分配和使用. Drill 允许开发人员重用他们现有的 Hive 部署,从而很好地发挥了 Hive 的作用.

2.11. Apache Mahout

Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets.

Mahout用于创建可扩展的开源框架机器学习 算法和数据挖掘库.一旦数据存储在 Hadoop HDFS 中,mahout 就提供了数据科学工具,可以在这些大数据集中自动找到有意义的模式.

Algorithms of Mahout are:

2.12. Apache Sqoop

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.

Sqoop 将外部来源的数据导入到相关的 Hadoop 生态系统组件中,如 HDFS 、 Hbase 或 Hive.它还将数据从 Hadoop 导出到其他外部源.Sqoop 与 teradata 、 Netezza 、 oracle 、 MySQL 等关系数据库一起工作.

Apache Sqoop Diagram

Apache Sqoop Diagram

Features of Apache Sqoop:

2.13. Apache Flume

Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment. It uses a simple extensible data model that allows for the online analytic application. Using Flume, we can get the data from multiple servers immediately into hadoop.

Flume 高效地从其来源收集、聚合和移动大量数据,并将其发送回 HDFS.是一种容错可靠的机制. 这个 Hadoop 生态系统组件允许从数据源到 Hadoop 环境的数据流. 它使用一个简单的可扩展数据模型,允许在线分析应用程序. 使用 Flume,我们可以立即将多个服务器的数据获取到 hadoop 中.

Apache Flume

Apache Flume

Refer** Flume Comprehensive Guide** for more details

参考** 水槽综合指南**更多详情

2.14. Ambari

Ambari, another Hadop ecosystem component, is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform for operational control.

另一个 Hadop 生态系统组件 Ambari 是一个管理平台,用于调配、管理、监控和保护 apache Hadoop 集群.随着 Ambari 为操作控制提供一致、安全的平台,Hadoop 管理变得更加简单.

Ambari Diagram

Ambari Diagram

Features of Ambari:

2.15. Zookeeper

Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper manages and coordinates a large cluster of machines.

是用于维护配置信息、命名、提供分布式同步和提供组服务的集中式服务和 Hadoop 生态系统组件.动物园管理员负责管理和协调一大群的机器.

ZooKeeper Diagram

Features of Zookeeper:

2.16. Oozie

It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

它是一个管理 apache Hadoop 作业的工作流调度系统.Oozie 将多个作业按顺序组合成一个逻辑工作单元.Oozie 框架与 apache Hadoop 堆栈、 YARN 作为架构中心完全集成,并支持 apache MapReduce 、 Pig 、 Hive 和 Sqoop 的 Hadoop 作业.

Oozie Diagram

Oozie Diagram

In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.

在 Oozie 中,用户可以创建工作流的有向无环图,该图可以在 Hadoop 中并行、顺序运行.Oozie 是可扩展的,可以在一个Hadoop 集群.Oozie 也非常灵活.可以轻松地启动、停止、暂停和重新运行作业.甚至可以跳过特定的失败节点,或者在 Oozie 中重新运行它.

There are two basic types of Oozie jobs:

Oozie 工作有两种基本类型:

This was all about Components of Hadoop Ecosystem

这都是关于 Hadoop 生态系统的组件.

3. Conclusion: Components of Hadoop Ecosystem

We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem components empower Hadoop functionality. As you have learned the components of the Hadoop ecosystem, so refer Hadoop installation guide to use Hadoop functionality. If you like this blog or feel any query so please feel free to share with us.

我们已经详细介绍了 Hadoop 生态系统的所有组件.因此,这些 Hadoop 生态系统组件 Hadoop 功能. 您已经了解了 Hadoop 生态系统的组件,请参考 Hadoop 安装指南 使用 Hadoop 功能.如果你喜欢这个博客,或者有任何疑问,请随时与我们分享.

Reference for Hadoop

Hadoop 的参考资料

https://data-flair.training/blogs/hadoop-ecosystem-components

Https://data-flair.training/blogs/hadoop-ecosystem-components

上一篇下一篇

猜你喜欢

热点阅读