006 Hadoop 分析工具 Top 5 - 深入了解高级分析
006 Top 5 Hadoop Analytics Tools – Take a Dive into Advanced Analytics
Hadoop is an open source distributed storage and processing framework. It is at the center of the growing big data ecosystem. It gets used for advanced analytics which includes predictive analytics, data mining and machine learning. Hadoop is a technology which manages data processing and storage for big data applications. And can work with various forms of structured and unstructured data. So, let’s explore Hadoop Analytics Tools.
Hadoop 是一个开源的分布式存储和处理框架. 它是不断增长的大数据生态系统的中心.它被用于高级分析,包括预测分析、数据挖掘和 机器学习. Hadoop 是为大数据应用管理数据处理和存储的技术. 可以处理各种形式的结构化和非结构化数据. 让我们来探索一下 Hadoop 分析工具.
Hadoop Analytics ToolsList of Top Hadoop Analytics Tools
Hadoop 分析工具列表
Below are the top 5 Hadoop Analytics Tools, let’s discuss them in detail –
下面是前 5 大 Hadoop 分析工具,让我们详细讨论一下 --
1. Spark
Apache Spark provides in-memory data processing for developers and data scientists. Its easy development, flexibility, and speed have made it one of the popular Apache projects. It is the successor to MapReduce as a standard execution engine for Hadoop. Apache spark enables real-time, batch and advanced analytics over Hadoop platform. Spark is increasingly becoming the default data execution engine for analytics workload.
Apache Spark 为开发人员和数据科学家提供内存数据处理. 它的简单开发、灵活性和速度使得它成为流行的 Apache 项目之一. 它是 MapReduce 作为 Hadoop 标准执行引擎的继任者. Apache spark 支持实时、批处理和高级分析 Hadoop 平台. Spark 正日益成为分析工作负载的默认数据执行引擎.
Features of Spark:
Spark 的特点:
-
Ability to cache datasets to perform interactive data analysis. Extract a working set, cache it and query it repeatedly.
-
Interactive command line interface in Scala or in Python for low latency data exploration
-
High-level library for stream processing, through Spark Streaming.
-
High-level libraries for machine learning and graph processing. Spark is ten times much faster than disk-based apache mahout because of its distributed memory-based architecture.
-
能够缓存数据集以执行交互式数据分析.抽取一个工作集,对其进行缓存,反复查询.
-
Scala 或 Python 中的交互式命令行界面,用于低延迟数据探索
-
通过 Spark 流处理流的高级库.
-
用于机器学习和图形处理的高级库.Spark 的速度比基于磁盘的 apache mahout 快十倍,因为它的分布式基于内存的体系结构.
2. Apache Impala
Apache Impala provides massively parallel processing SQL analytics. It opens up interactive BI for the business analyst. Apache Impala is great at performance and concurrency requirements. These are features which are necessary for building an analytic database. It is natively integrated with Hadoop and leading BI tools to provide with a low-cost platform for analytics.
提供大规模并行处理 SQL 分析.它打开了面向业务分析师的交互式 BI.Apache Impala 在性能和并发需求方面非常出色.这些是构建分析数据库所必需的特性.它与 Hadoop 和 领先的 BI 工具提供一个低成本的分析平台.
Features of Impala:
-
Performance equivalent to leading MPP (Massively Parallel Processing) databases.
-
Faster time to insight than traditional databases. Faster interactive analytics directly on data stored in Hadoop.
-
Cost savings due to reduced data movement, modeling, and storage.
-
A more complete analysis of historical and raw data. Without information loss due to aggregation or conforming to the fixed schema.
-
Freedom from vendor lock-in through open source Apache license.
-
Security with Kerberos authentication. And role-based authorization through apache sentry project.
-
性能相当于领先的 MPP (大规模并行处理) 数据库.
-
比传统数据库更快的洞察时间.更快的互动分析直接Hadoop 中存储的数据.
-
由于减少了数据移动、建模和存储,节省了成本.
-
对历史和原始数据进行更完整的分析.不会因为聚合或符合固定模式而导致信息丢失.
-
通过开源 Apache 许可证不受供应商锁定.
-
使用 Kerberos 身份验证的安全性.通过 apache sentry 项目进行基于角色的授权.
3. MapReduce
Hadoop MapReduce is a framework for writing applications to process a huge amount of data. They do so in parallel on a large cluster of commodity hardware in a reliable and fault tolerant manner. The job submitted by the client gets divided into a number of independent tasks. These tasks run in parallel giving high throughput. The Map-reduce job is majorly divided into Map tasks and reduce tasks. Usually, programmers write the entire business logic in the map task. And reduce task perform summarization on the input dataset.
Hadoop MapReduce是一个编写应用程序以处理大量数据的框架.它们以可靠和容错的方式在大型商品硬件集群上并行执行.客户提交的作业被分成若干独立的任务.这些任务并行运行,吞吐量高.的地图缩小作业主要被分为地图任务和减少任务.通常,程序员在 map 任务中编写整个业务逻辑.并减少任务对输入数据集执行摘要.
Features of Hadoop MapReduce:
-
Easily scalable architecture. Can add machines to increase the processing power of the cluster.
-
Fault Tolerance – It automatically and seamlessly recovers from failure.
-
Load Balancing – Intra datanode balancer which we can invoke through CLI. Resolves the data skew issue within a node.
-
Security – POSIX based file permissions for users and groups with optional LDAP integration.
-
易于扩展的架构.可以增加机器,增加集群的处理能力.
-
容错-从故障中自动无缝恢复.
-
负载平衡-我们可以通过 CLI 调用的内部数据阳极平衡器.解决节点内的数据倾斜问题.
-
安全-基于 POSIX 的文件权限,适用于具有可选 LDAP 集成的用户和组.
4. Mahout
Apache Mahout is a library of various scalable machine learning algorithms. It gets implemented on the top of Hadoop using Map-Reduce paradigm. Machine learning is the discipline of Artificial Intelligence. It is focused on enabling machines to learn without being explicitly programmed. It is commonly used to improve performance in the future based on previous outcomes.
Apache Mahout 是各种可扩展的库机器学习算法.它使用 Map-Reduce 范式在 Hadoop 的顶部实现.机器学习的学科是人工智能.它的重点是让机器在没有明确编程的情况下学习.它通常用于基于以前的结果来提高未来的绩效.
Features of Mahout:
-
Collaborative filtering is mining user behavior and making product recommendations.
-
Clustering is taking items from a particular class and organizing them in naturally occurring groups. In such a way that items occurring in the same group are similar to each other.
-
Classification is learning from existing categorization and then assigning unclassified items to the best categories.
-
协同过滤是对用户行为的挖掘和对产品的推荐.
-
聚类是从特定的类中获取项目,并将它们组织在自然存在的组中.这样,在同一组中出现的项目彼此相似.
-
分类是从现有的分类中学习,然后将未分类的项目分配给最佳类别.
5. Apache Hive
Apache Hive is a data warehouse software. It facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to query data using SQL like language i.e. HQL. At the same time, this language allows map-reduce programmers to plug in their source code i.e. mappers and reducers. This is when it is inconvenient or inefficient to express the logic in HiveQL.
数据仓库软件 Apache Hive.它便于查询和管理分布存储中的大型数据集.Hive 提供了一种机制使用类似 SQL 的语言查询数据 即 HQL.与此同时,这种语言允许 map-reduce 程序员插入他们的源代码,即 mappers 和 reducers.这是在 HiveQL 中表达逻辑不方便或效率低下的时候.
Features:
-
Provides indexing to accelerate the process. Indexing type includes bitmap and compaction indexes.
-
Has different storage types such as HBase, RC files, ORC, plain text and others.
-
Metadata storage in RDBMS significantly reduces the time taken for semantic checks.
-
Operating on compressed data stored in the Hadoop ****ecosystem is possible through algorithms like gzip, Bzip2, snappy, etc.
-
We can have user-defined functions to manipulate data and strings. Hive supports extending UDF to handle use cases not supported by built-in functions.
-
SQL like queries that is HiveQL which are implicitly converted into MapReduce jobs.
-
提供加速过程的索引.索引类型包括位图和压缩索引.
-
具有不同的存储类型,如 HBase 、 RC 文件、 ORC 、纯文本等.
-
关系数据库中的元数据存储大大减少了语义检查所需的时间.
-
存储在压缩数据上的操作Hadoop****生态系统通过 gzip 、 Bzip2 、 snappy 等算法是可能的.
-
我们可以使用用户定义的函数来操作数据和字符串.Hive 支持扩展 ref 来处理内置函数不支持的用例.
-
SQL就像 HiveQL 查询一样,它被隐式转换成 MapReduce 作业.
So, this was all about Hadoop Analytics Tools. Hope you liked our explanation.
所以,这都是关于 Hadoop 分析工具的.希望你喜欢我们的解释
Summary
Hadoop is great for MapReduce data analysis on the huge amount of data. Its specific use cases include data searching, data reporting, large scale indexing of files i.e. log files or data from web crawlers. Mahout lets you analyze large sets of data effectively in less time. Impala makes processing data easy. It gives response quickly in real-time. The performance of the repeated query is even better. Apache Spark gives high performance for both batch and streaming data. It uses state-of-the-art DAG scheduler, query optimizer, and physical execution engine.
Hadoop 对于海量数据的 MapReduce 数据分析非常有用.它的具体使用案例包括数据搜索、数据报告、文件的大规模索引,即日志文件或来自网络爬虫的数据.Mahout 让您能够在更短的时间内有效地分析大量数据.Impala 使处理数据变得简单.它可以实时快速响应.重复查询的性能更好.Apache Spark 为批处理和流数据提供了高性能.它使用最先进的 DAG 调度器、查询优化器和物理执行引擎.
Still, if you have any query regarding Hadoop Analytics Tools, ask in the comment tab.
尽管如此,如果您对 Hadoop 分析工具有任何疑问,请在 “评论” 选项卡中询问.