我爱编程

《Apache Hadoop YARN: Moving MapR

2018-03-24  本文已影响49人  AlstonWilliams

Different Components

NodeManager

The NodeManager is YARN‘s per-node "worker" agent, taking care of the individual compute nodes in a Hadoop cluster. Its duties include keeping up-to-date with the ResourceManager, overseeing application containers' life-cycle management, monitoring resource usage of individual containers, tracking node health, log management, and auxiliary services that may be exploited by different YARN applications.

On startup, the NodeManager registers with the ResourceManager; it then sends heartbeats with its status and waits for instructions. Its primary goal is to manager application containers assigned to it by the ResourceManager.

YARN containers are described by a ContainerLaunchContext. This record includes a map of environemnt variables, dependencies stored in remotely accessible storage, security tokens, payloads for NodeManager services, and the command necessary to create the process. After validating the authenticity of the container lease, the NodeManager configures the environment for the container, including initializing its monitoring subsystem with the resource constraints' specified application. The NodeManager also kills containers as directed by the ResourceManager.

ApplicationMaster

The ApplicationMaster is the process that coordinates an application's execution in the cluster. Each application has its own unique ApplicationMaster, which is tasked with negotiating resources from the ResourceManager and working with the NodeManager to execute and monitor the tasks. In the YARN design, MapReduce is just one application framework; this design permits building and deploying distributed applications using other frameworks. For example, YARN ships with a Distributed-Shell application that allows a shell script to be run on multiple nodes on the YARN cluster.

One the ApplicationMaster is started(as a container), it will periodically send heartbeats to the ResourceManager to affirm its health and to update the record of its resource demands. After building a model of its requirements, the ApplicationMaster encodes its perferences and constraints in a heartbeat message to the ResourceManager. In response to subsequent heartheats, the ApplicationMaster will receive a lease on containers bound to an allocation of resources at a particular node in the cluster. Depending on the containers it receives from the ResourceManager, the ApplicationMaster may update its execution plan to accommodate the excess or lack of resouruces. Container allocation/deallocation can take place in a dynamic fashion as the application progress.

Client ResourceRequest

The client must first notify the ResourceManager that it wants to submit an application. The ResourceManager responds with an ApplicationID and information about the capabilities of the cluster that will aid the client in requesting resources.

ApplicationMaster Container Allocation

When the ResourceManager receives the application submission context from a client, it schedules an available container for the ApplicationMaster. This container is often called "Container0" because it is the ApplicationMaster, which must request additinal containers. If there are no applicable containers, the request must wait. If a suitable container can be found, then the ResourceManager contacts the appropriate NodeManager and starts the ApplicationMaster. As part of this step, the ApplicationMaster RPC port and tracking URL for monitoring the application's status will be established.

In response to the registration request, the ResourceManager will send information about the minimum and maximum capabilities of the cluster. At this point the ApplicationMaster must decide how to use the capabilites that are current available. Unlike some resource schedulers in which clients request hard limit, YARN allows applications to adapt to the current cluster environment.

Based on the available capabilities reported from the ResourceManager, the ApplicationMaster will request a number of containers. This request can be very specific, including containers with multiples of the resource minimum values. The ResourceManager will respond, as best as possible based on scheduling policies, to this request with container resources that are assigned to the ApplicationMaster.

As a job progresses, heartbeat and progress information is sent from the ApplicationMaster to the ResourceManager, within these heartbeats, it is possible for the ApplicationMaster to request and release containers. When the job finished, the ApplicationMaster sneds a finish message to the ResourceManager and exits.

ApplicationMaster-Container Manager Communication

The ApplicationMaster will indepently contact its assigned node managers and provide them with a ContainerLaunchContext that includes environment variables, dependencies located in remote storage, security tokens, and commands needed to start the actual process. When the container starts, all data files, executable, and necessary dependencies are copied to local storage on the node. Dependencies can potentially be shared between containers running the application.

Once all containers have started, their status can be checked by the ApplicationMaster. The ResourceManager is absent from the application progress and is free to schedule and monitor other resources. The ResourceManager can direct the NodeManager to kill containers. Expected kill events can happen when the ApplicationMaster informs the ResourceManager of its completion, or the ResourceManager needs nodes for another applications, or the container has exceeded its limits. When a container is killed, the NodeManager cleans up the local working directory when a job is finished, the ApplicationMaster informs the ResourceManager that the job completed successfully. The ResourceManager then informs the NodeManagerrrr to aggregate logs and cleanup container-specific files. The NodeManagers are also instructed to kill any remaining containers(including the ApplicationMaster) if they have not already exited.

LocalResource Visibilities

The ApplicationMaster specifies the visibility of a LocalResource to a NodeManager while starting the container; The NodeManager itself doesn't make any decisions or classify resources. Similarly, for the container running the ApplicationMaster itself, the client has to specify visibilities for all the resources that the ApplicationMaster needs.

In case of a MapReduce application, the MapReduce Job Client decides the resource type which the corresponding ApplicationMaster then forwards to the NodeManager.

Lifetime of LocalResources

Different types of LocalResources have different life cycles:

Hadoop YARN architecture

ResourceManager components

Client interaction with the ResourceManager
Application interaction with the ResourceManager

The following describes how the ApplicationMasters interact with the ResourceManager once they have started.

Interaction of Nodes with the ResourceManager

The following components in the ResourceManager interact with the NodeManagers running on cluster nodes.

Core ResourceManager Components
Security-related components in the ResourceManager

The ResourceManager has a collection of components called SecretManagers that are charged with managing the tokens and secret keys that are used to authenticate/authorize requests on various RPC interfaces.

NodeManager

It responsibilties include the following tasks:

Overview of the NodeManager components

After registering with the ResourceManager. the NodeManager periodically sends a heartbeat with its current status and receives instructions, if any, from the ResourceManager. When the scheduler gets to process the node's heartbeat, containers are allocated against that NodeManager and then are subsequently returned to the ApplicationMasters when the ApplicationMasters themselves send a heartbeat to the ResourceManager.

Before actually launching a container, the NodeManager copies all the necessary libraries - data files, executables, tarballs, jar fiels, shell scripts, and so on - to the local file system. The downloaded libraries may be shared between containers of a specific application via a local applicaiton-level cache, between containers launched by the same user via a local user-level cache, and even between users via a public cache, as can be specified in the ContainerLaunchContext. The NodeManager eventually garbage-collects libraries that are not in use by any running containers.

The NodeManager may also kill containers as directed by the ResourceManager. Containers may be killed in the following situations:

Whenever a container exists, the NodeManager will clean up its working directory in local storage. When an application completes, all resources owned by its containers are cleaned up.

In addition to starting and stopping containers, cleaning up after exited containers, and managing local resources, the NodeManager offers other local services to containers running on the node. For example, the log aggregation service uploads all the logs written by the application's containersr to stdout and stderr to a file system once the application completes.

NodeManager components
Impotant NodeManager Functions

ApplicationMaster

Overview

The process starts when an application submits a request to the ResourceManager. Next, the ApplicationMaster is started and registers with the ResourceManager. The ApplicationMaster then requests containers from the ResourceManager to perform actual work. The assigned containers are presented to the NodeManager for use by the ApplicationMaster. Computation takes place in the containers, which keep in contact with the ApplicationMaster as the job progresses. When the application is complete, containers are stopped and the ApplicationMaster is unregisterred from the ResourceManager.

Once successfully launched, the ApplicationMaster is responsible for the following tasks:

Liveliness

The first operation that any ApplicationMaster has to perform is to register with the ResourceManager. As part of the registration, ApplicationMasters can inform the ResourceManager about an IPC address and/or a web URL.

In the registration response, the ResourceManager returns information that the ApplicationMaster can use, such as the minimum and maximum sizes of resources that YARN accepts, and the ACLs associated with the application that are set by the user during application submission. The ApplicationMaster can use these ACLs for authorizing user requests on its own client-facing service.

Once registered, an ApplicationMaster periodically needs to send heartbeats to the ResourceManager to affirm its liveliness and health.

Resource Requirements

Resource requirements are referred to as static when they are decided at the time of application submission and when, once the ApplicationMaster starts running, there is no change in that specification. For example, in the case of MapReduce, the number of mappers and reducers.

When dynamic response requirements are applied, the ApplicationMaster may choose how many resources to request at run time based on criteria such as user hints, availability of cluster resources, and business logic.

Scheduling

When an ApplicationMaster accumulates enough resource requests or a timer expires, it can send the requests in a heartbeat message, via the 'allocate' API, to the ResourceManager. The 'allocate' call is the single most important API between the ApplicationMaster and the scheduler. It is used by the ApplicationMaster can invoke the 'allocate' API; all such calls are seralized on the ResourceManager per ApplicationAttempt. Because of this, if multiple threads asks for resources via the 'allocate' API, each thread may get an inconsistent view of the overall resource requests.

Scheduling Protocol and Locality
  1. ResourceRequests
    The ResourceRequest object is used by the ApplicationMaster for resource requests. It includes the following elements:
    • Priority of the request
    • The name of the resource location on which the allocation is desired. It currently accepts a machine or a rack name. A special value of "*" signifies that any host/rack is acceptable to the application.
    • Resource capability, which is the amount or size of each container required for that request.
    • Number of containers, with respect to the specifications of priority and resource location, that are required by the application.
    • A Boolean relaxLocality flag(defaults to true), which tells the ResourceManager if the application wants locality to be loose(i.e., allow fall-through to rack or "*" in case of no local containers) or strict(i.e., specify hard constraints on container placement).
Priorities

Higher-priority requests of an application are served first by the ResourceManager before the lower-priority requests of the same application are handled. There is no cross-application implication of priorities.

Launching Containers

Once the ApplicationMaster obtains containers from the ResourceManager, it can then proceed to actual launch of the containers. Before launching a container, it first has to construct the ContainerLaunchContext object according to its needs, which can include allocated resource capability, security tokens, the command to be executed to start the container, and more. It can either launch containers one by one by communicating to a NodeManager, or it can batch all containers on a single node together and launch them in a single call by providing a list of StartContainerRequests to the NodeManager.

The NodeManager sends a response via StartContainerResponse that includes a list of successfully launched containers, a container ID-to-exception map for each failed StartContainerRequest in which the exception indicates errors per container, and an allServicesMetaData map from the names of auxiliary services and their corresponding metadata.

The ApplicationMaster can also get updated statuses for submitted but still to be launched containers as weel as already launched containers.

The ApplicationMaster can also request a NodeManager to stop a list of containers running on that node by sending a StopContainersRequest that includes the container IDs of the containers that should be stopped. The NodeManager sends a repsonse via StopContainersResponse, which includes a list of container IDs of successfully stopped containers as well as a container ID-to-exception map for each failed request in which the exception indicates errors from the particular container.

When an ApplicaitonMaster exits, depending on its submission context, the ResourceManager may choose to kill all the running containers that are not explicitly terminated by the ApplicationMaster itself.

Completed Containers

When containers finish, the ApplicationMasters are informed by the ResourceManager about the event. Because the ResourceManager does not interpret the container status, the ApplicationMaster determines the sematics of the success or failure of the container exit status reported through the ResourceManager.

Handling of container failures is the responsibility of the applications/frameworks. YARN is responsible only for providing information to the applicaitons/framewoks. The ResourceManager collects information about all the finished containers as part of the 'allocate' API's response, and it returns this information to the corresponding ApplicationMaster. It is up to the ApplicationMaster to look at information such as the container status, exit code, and diagnostics information and act on it appropriately. For example, when the MapReduce ApplicationMaster learns about container failures, it retries map or reduce tasks by requesting new containers from the ResourceManager until a configured number of attempts fail for a single task.

ApplicationMaster failures and recovery

The ApplicationMaster is also tasked with recovering the application after a restart that was due to the ApplicationMaster's own failure. When an ApplicationMaster fails, the ResourceManager simply restarts an application by launching a new ApplicationMaster for a new ApplicationAttempt; it is the responsibility of the new ApplicationMaster to recover the application's previous state. This goal can be achieved by having the current ApplicationAttempts persist their current state to external storage for use by future attempts. Any ApplicationMaster can obvisously just run the appliaction from scratch all over again instead of recovering the part state. The Hadoop MapReduce framework's ApplicationMaster recovered its completed tasks, but running tasks as well as the tasks that completed during ApplicationMaster recovery would be killed and rerun.

Coordination and output commit

If a framework supports multiple containers contending for a resource or an output commit, ApplicationMaster should provide synchronization primitives for them, so that only one of those containers can access the shared resource, or it should promote the output of one while the other is ordered to wait or abort. The MapReduce ApplicationMaster defines the number of multiple attempts per task that can potentially run concurrently; it also provides APIs for tasks so that the output-commit operation demonstrates consistency.

YARN can't guarantee there is only one application attempt because the network partitions. So Application writers must be aware of this possibility, and should code their applications and frameworks to handle the potential multiple-writer problems.

Information for clients
Security

If the application exposes a web service or an HTTP/socket/RPC interface, it is also responsible for all aspects of its secure operation. YARN newely secures its deployment.

Cleanup an ApplicationMaster exit

When an ApplicationMaster is done with all its work, it should explicitly unregister with the ResourceManager by sending a FinishApplicationRequest. Similar to registration, as part of this request ApplicationMasters can report IPC and web URLs where clients can go once the application finishes and the ApplicaitonMaster is no longer running.

Once an ApplicationMaster's 'finish' API causes the application to finish, the ApplicationMaster exits on its own or the ApplicationMaster liveliness interval is reached. This is done so as to enable ApplicationMasters to do some cleanup after the finish API is successfully recorded on the ResourceManager.

YARN Containers

Container Environment

Dynamic information includes settings that can potentially change during the lifetime of a container. It is composed of things like the location of the parent ApplicaitonMaster and the location of map outputs for a reduce task. Most of this information is the responsibility of the application-specific implementation. SOme of the options includes the following:

Communication with the ApplicationMaster

When containers exit, ApplicationMaster will eventually learn about their completion status, either directly from the NodeManager or via the NodeManager-ResourceManager-ApplicationMaster channel for status of completed containers.

If an application needs its container to be in communication with its ApplicaitonMaster, however, it is entirely up to the applicaiton/framework to implement such a protocol.

Summary for Application-writers

CapacityScheduler

Ideas

  1. Elesticity with multitenancy
  2. Security: ACLs
  3. Resource Awareness
  4. Granular Scheduling
  5. Locality
  6. Scheduing Policies

Queues

Every queue in the CapacityScheduler has the following properties:

Hierarchical Queues
  1. Key Characteristics
    Queues are of two types: parent queues and leaf queues.
    Parent queues enable the management of resources across organizations and suborganizatins. They can container more parent queues or leaf queues. They do not themselves accept any application submissions directly.
    Leaf queues denote the queues that live under a parent queue and accept applications. Leaf queues do not have any more child queues.

  2. Scheduling Among Queues
    The scheduling olgorithm works as follows:

Example hierarchies for use by CapacityScheduler:


There are limitations on how one can name the queues. To avoid confusion, the CapacityScheduler doesn't allow two leaf queues to have the same name across the whole hierarchy.

Queue Access Control

Although applicaiton submission can really happend only at the leaf queue level, an ACL on a parent queue can be set to control admittance to all the descendant queues.

Access Control Lists can be configured by following format: a comma-separated list of users, followed by a space separator, followed by a comma-separated list of groups, for example:
user1, user2 group1, goup2

If it is set to "*", all users and groups are allowed to perform the operation guarded by the ACL in question. If it is set to " ", no users or groups are allowed to perform the operation.

Capacity Management with Queues

Assume the administrators decide to share the cluster resources between the grunpy-engineers, finance-wizards, and marketing-moguls in a 6:1:3 ratio, the corresponding queue configuration will be as follows:

{
  "yarn.scheduler.capacity.root.grumpy-engineers.capacity": "60",
  "yarn.scheduler.capacity.root.finance-wizards.capacity": "10",
  "yarn.scheduler.capacity.root.marketing-moguls.capacity": "30"
}

During scheduling, queues at any level in the hierarchy are sorted in the order of their current used capacity and available resources are distribtued among them, starting with those queues that are the most under-served at that point in time. With respect to just capacities, the resource scheduling has the following flow:

  1. The more under-served the queues, the higher the priority that is given to them during resource allocation. The most under-served queue is the queue with the smallest ratio of used capacity to the total cluster capacity.
    1.1 The used capacity of any parent queue is defined as the aggregate sum of used capacity of all the descendant queues recursively.
    1.2 The used capacity of a leaf queue is the amount of resources that is used by allocated containers of all applications running in that queue.

  2. Once it is decided to give a parent queue the freely available resources, further similar scheduling is done to decide recursively as to which child queue gets to use the resources based on the same concept of used capacities.

  3. Scheduling inside a leaf queue further happens to allocate resources to applications arriving in a FIFO order.
    3.1 Such scheduling also depends on locality, user level limits, and application limits
    3.2 Once an application within a leaf queue is chosen, scheduling happens within an application, too. Applications may have different resource requests at different priorities.

  4. To ensure elasticity, capacity that is configured but not utilized by any queue due to lack of demand is automatically assigned to the queues that are in need of resources.

To prevent a queue increasing too much and hold on the whole resources, and make other application block, administrators can set the following limit:

{
  "yarn.scheduler.capacity.root.grumpy-engineers.infinite-monkeys.maximum-capacity": "40"
}

Capacities and maximum capacities can be dynamically changed at run time using the rmadmin "refresh queues" functionality.

User Limits

Leaf queues have the additional responsibiltiy of ensuring fairness with regard to scheduling applications submitted by various users in that queue. The capacity scheduler places various limits on users to enforce this fairness. Recall that applications can only be submitted to leaf queues in the capacity scheduler; thus, parent queue do not have any role in enforcing user limits.

Assume the queue capacity needs to be shared among not more than five users in the thrifty-treasurers queue. When you account for fairness, this results in each of those five users being given an equal share(20%) of the capacity of the "root.finance-wizards.thrifty-treasurers" queue. The following configuratin for the finance-wizards queue applies this limit:

{
  "yarn.sheduler.capacity.root.finance-wizards.thrifty-treasures.minimum-user-limit-percent": "20"
}

The CapacityScheduler's leaf queues have the ability to restrict or expand a user's share within and beyond the queue's capacity through the per-leaf-queue "user-limit-factor" configuration. It denotes the fraction of queue capacity that any single user can grw, up to a maximum, irrespective of whether there are idle resources in the cluster.

{
  "yarn.scheduler.capacity.root.finance-wizards.user-limit-factor": "1"
}

The default value of 1 means that any single user in that queue can, at maximum, occupy only the queue's configured capacity. This value avoids the case in which users in a single queue monopolize resources across all queues in a cluster. By extension, setting the value to 2 allows the queue to grow to a maximum of twice the size of the queue's configured capacity.

Reservations

The CapacityScheduler's responsibility is to match free resources in the cluster with the resource requirements of an application. Many times, however, a scheduling cycle occurs in such a way that even though there are free resources on a node, they are not large enough in size to satisfy the application that is at the head of the queue. This situation typically happens with large-memory applications whose resource demand for each of their containers is much larger than the typical application running in the cluster. When such applicaitons run in the cluster, anytime a regular applicaiton's containers finish, thereby releasing previously used resources for new cycles of scheduling, nodes will have freely available resources but the large-memory applications cannot take advantage of them because the resources are still too small. If left unchecked, this mismatch can cause starving of resource-intensive applications.

The CapacityScheduler solves this problem with a feature called reservation. The scheduling flow for reservations resembles the following:

The CapacityScheduler only maintains one active reservation per node.

State of the Queues

RUNNING(default) and STOPPED.

For an application to be accepted at any leaf queue, all of the queues in the ancestry - all the way to be root queue - need to be running.

The following configuration indicates the state of the 'finance-wizards' queue:

{
  "yarn.scheduler.capacity.root.finance-wizards.state": "RUNNING"
}
Limits on Applications

The following configuration controls the total number of concurrently active(both running and pending) applications at any single point in time:

{
  "yarn.scheduler.capacity.maximum-applications": "10000"
}

Every queue's maximum applications can be configured by:

{
  "yarn.scheduler.capacity.<queue-path>.maximum-applications": "absolute-capacity * yarn.scheduler.capacity.maximum-applications"
}

The limit on the maximum percentage of resources in the cluster that can be used by the ApplicationMasters can be configured by:

{
  "yarn.scheduler.capacity.maximum-am-resource-parent": "0.1"
}

For each queue can be configured by:

{
  "yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent": "0.1"
}
上一篇 下一篇

猜你喜欢

热点阅读