2018-01-10 Hadoop Platform and A

2018-01-10  本文已影响0人  鸭鸭学语言

YARN

It support classic MapReduce framework

It also support other open source / commercial applications running on it, like Impala, Storm and they do not need change anything. 

It also support user developed applications

It also enables frameworks like Tez, Spark


Execution Frameworks: YARN, Tez, Spark

    Support DAG(directed acyclic graph) of tasks.

    In memeory caching of data


MapReduce

Application engine.

Applications fits the MapReduce paradigm: need know the distributed data chains, and which are independent of each other, and then have the shuffle process that will feed the data into the reduce process.

Application does not fit the MapReduce paradigm:

    Interactive data exploration - load data into memeory to avoid loading data from disk again and again.

    Iterative data procesing - Machine Learing algorithms.


Tez

Application engine.

Features:

    Handle Dataflow graphs with expressive API.

    Support customized data types and customized logic application, so no restriction as on MapReduce of framework.

    Can run complex DAG of tasks

    Dynamic DAG changes

    Reuse resource(containers) to avoid those costs of containers startup. More efficient.


Compare MapReduce and Tez on :

    Use case: 

        SELECT a.vendor, COUNT(*), AVG(c.cost) FROM  a JOIN b ON (a,id=b.id)  JOIN a ON (a.itemid=c.itemid) GROUP BY a.vendor

 MapReduce Tez

Spark

Application engine.

It could run on HDFS directly without YARN is needed. It can also run on other storage too. 

Features:

    Advance DAG execution engine - Data can be shared across DAGs, between iterations and reused. So much faster than other DAG engines.

    Support cyclic data flow

    In-memory computing. If out of memory, it excels at gracefully spilling over to disks.

    Can be accessd from Java, Scala, Python, R

    Existing optimized libraries


Hadoop Resource Scheduling

Schedulers:

    FIFO (default)

    Fairshare - balance resource between application, default resource is memory but we can add CPUs as resource.

        Balance out resource allocation among apps over time.

        Can organize into queues/sub-queues

        Garrantee minimum shares

        Weighted app priorities

    Capacity - guaratee resource for each application

        Queues and sub-queues

        Capacity Guarantee with elasticity

        ACLs for security

        Runtime changes/draining apps

        Resource based scheduling


    Lesson 4 Slides

上一篇下一篇

猜你喜欢

热点阅读