Flink 使用之 Yarn 资源问题排查

2023-01-13 本文已影响0人 AlienPaul

Flink 使用介绍相关文档目录

前言

Flink作业提交的时候会遇到任务无法提交，或者是长时间处于ACCEPTED状态。此时需要重点排查Yarn的资源的相关配置。

本篇为大家带来Flink on Yarn 资源问题的排查思路。

典型报错

Flink on Yarn程序提交的时候如果资源不足，JobManager会出现类似如下的错误：

java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the requeired slot within slot request timeout

根因是Yarn的资源不足，或者是超过配置限制。

确定Flink使用的资源

首先需要根据Flink配置文件，或者作业的提交命令，确定Flink的资源配置，以及作业提交到了哪个队列，是否指定了node label或者指定了哪个node label。

使用Yarn session提交的Flink作业，注意检查的参数有：

-s: 每个TaskManager的slot数量。
-qu: 提交任务到哪个队列。
-nl: 作业使用的Yarn标签资源。
-jm: JobManager使用的内存。
-tm: TaskManager使用的内存。

Yarn cluster模式：

-ys: 每个TaskManager的slot数量。
-yqu: 提交任务到哪个队列。
-ynl: 作业使用的Yarn标签资源。
-yjm: JobManager使用的内存。
-ytm: TaskManager使用的内存。

另外提交作业时还有:

-p: 指定提交作业时的默认并行度。如果作业内算子没有明确指定并行度，则使用该值。覆盖parallelism.default配置值。

flink-conf.yaml配置文件：

taskmanager.numberOfTaskSlots: TaskManager slot数量。
parallelism.default: 默认的并行度。决定会启动多少个TaskManager。
jobmanager.memory.process.size: JobManager总内存大小。包含堆内，堆外以及JVM metaspace等JVM本身占用的内存大小。
taskmanager.memory.process.size: TaskManager总内存大小。

Yarn资源相关配置

本节列出了需要重点检查Yarn的配置。

内存相关：

yarn.nodemanager.resource.memory-mb: 每个nodemanager最多可用于分配给container的内存数量。整个Yarn集群的可用内存数量为yarn.nodemanager.resource.memory-mb * nodemanager节点数。
yarn.scheduler.minimum-allocation-mb: RM为每个container分配的最小内存数。container占用内存大小的下限。
yarn.scheduler.maximum-allocation-mb: RM为每个container分配的最大内存数。container占用内存大小的上限。

CPU相关：

yarn.nodemanager.resource.percentage-physical-cpu-limit: CPU资源可供container使用的百分比。仅在启用CGroup(yarn_cgroups_enabled)的时候才会生效。
yarn.nodemanager.resource.cpu-vcores: 每个nodemanager最多可用于分配给container的CPU vcore数量。整个Yarn集群的可用CPU vcore数量为yarn.nodemanager.resource.cpu-vcores * nodemanager节点数。
yarn.scheduler.minimum-allocation-vcores: RM为每个container分配的最小CPU vcore数。container占用vcore数量的下限。
yarn.scheduler.maximum-allocation-vcores:RM为每个container分配的最大CPU vcore数。container占用vcore数量的上限。

容量调度资源配置：

yarn.scheduler.capacity.<queue-path>.capacity: 队列的最小资源数。发生强占的时候也需要保障队列的最小资源。
yarn.scheduler.capacity.<queue-path>.maximum-capacity: 队列的最大资源数。在其他队列资源不紧张的情况下可以使用超过yarn.scheduler.capacity.<queue-path>.capacity的资源。
yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent: 确保用户可获得的最小资源占队列资源的百分比。
yarn.scheduler.capacity.<queue-path>.user-limit-factor: 单个用户最多可使用的资源占队列总资源（最小资源）的百分比。
yarn.scheduler.capacity.<queue-path>.maximum-allocation-mb: 队列中每个container分配的最大内存数。覆盖yarn.scheduler.maximum-allocation-mb配置。必须小于或等于集群配置。
yarn.scheduler.capacity.<queue-path>.maximum-allocation-vcores: 队列中每个container分配的最大CPU vcore数。覆盖yarn.scheduler.maximum-allocation-vcores配置。必须小于或等于集群配置。
yarn.scheduler.capacity.<queue-path>.user-settings.<user-name>.weight: 某个用户的资源分配权重值。权重大的用户分得的资源较多。

容量调度应用限制配置：

yarn.scheduler.capacity.maximum-applications / yarn.scheduler.capacity.<queue-path>.maximum-applications: 集群或者队列最多可同时存在的running和pending应用数量。该项是硬限制。超过数量限制之后提交的应用会被拒绝。
yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent: 集群或队列中可用于运行Application Master的资源占比。
yarn.scheduler.capacity.max-parallel-apps / yarn.scheduler.capacity.<queue-path>.max-parallel-apps: 同样限制最大同时运行应用数但该配置是软限制（集群范围或队列范围）。超过数量限制之后提交的应用处于ACCEPTED状态，等待条件符合时运行。
yarn.scheduler.capacity.user.max-parallel-apps: (所有用户范围)每个用户最多提交的应用数。软限制。
yarn.scheduler.capacity.user.<username>.max-parallel-apps: 某个用户最多提交的应用数。软限制。

资源限制检查顺序：

Hadoop官网的解释如下：

maximum-applications check - if the limit is exceeded, the submission is rejected immediately.
max-parallel-apps check - the submission is accepted, but the application will not transition to RUNNING state. It stays in ACCEPTED until the queue / user limits are satisfied.
maximum-am-resource-percent check - if there are too many Application Masters running, the application stays in ACCEPTED state until there is enough room for it.

中文解释为：

检查maximum-applications。如果超出，拒绝作业提交。（硬限制）
检查max-parallel-apps。如果超出，作业进入ACCEPTED状态，等到条件满足时候恢复执行。（软限制）
检查maximum-am-resource-percent。如果超出，作业进入ACCEPTED状态，等到条件满足时候恢复执行。（软限制）

权限配置：

yarn.scheduler.capacity.<queue-path>.state: 队列状态。可以是RUNNING或者STOPPED。STOPPED状态禁止提交新应用，但是已经提交的应用可以继续运行直到结束。

标签调度相关：

Yarn的节点可以被绑定标签。从而可以限制Yarn作业调度的物理节点。当然也能够对作业资源进行限制。需要注意的是没有绑定任何标签的节点自成一类，他们能够被所有队列使用到。

使用yarn node -list -showDetails命令查看Yarn集群节点和节点绑定的label。通过绑定某个label的节点数和前面所述的节点可用内存和vcore配置，可以计算出该label纳管的资源最大值。

队列标签配置:

yarn.scheduler.capacity.root.accessible-node-labels: 队列可访问哪些标签资源。无标签的节点资源所有队列都可以访问。
yarn.scheduler.capacity.root.default-node-label-expression: 如果提交到该队列的app没有指定标签，则使用default-node-label-expression指定的标签资源。默认情况下该配置项为空，表示app将使用没有标签的节点。此项很重要，否则当用户提交应用没有指定标签时，即便指定了队列，标签资源仍然不可使用。

如果队列标签配置错误或者是用户提交应用时候使用的标签配置有误，很有可能导致应用无法获得足够的资源，最终无法运行。

Flink资源计算方法

TaskManager数量 = 向上取整(parallelism.default或者实际运行时指定的并行度 / taskmanager.numberOfTaskSlots)

如果各算子并行度不同，parallelism取用并行度最大的算子并行度。

总的Container数量 = 1 + TaskManager数量。

其中1是JobManager(AppMaster角色)，它自己占用一个container。

参考链接

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

https://zhuanlan.zhihu.com/p/335881182