E-HPC支持多队列管理和自动伸缩

2018-09-30 本文已影响22人阿里云云栖号

阿里云E-HPC（弹性高性能计算）在最近的发布中支持多队列调度以及管理，另外发布针对多队列调度自动伸缩的策略支持。
本文主要介绍以下内容

多队列调度的应用背景和应用场景
E-HPC支持多队列调度的功能实现
介绍各种HPC调度器类型对队列和节点组是如何配置管理的
介绍如何通过OpenApi的方式调用E-HPC多队列调度相关功能

前言

传统的HPC本地集群迁云过程中，部分会采用混合云的模式，例如如下模式，

image.png

云上计算资源规格可能是和本地的计算节点规格不一样，这就导致单个集群里需要支持多种规格的计算资源，HPC集群一般会用不同队列(job queue)或者节点组(node group)来管理不同规格的节点，然后分发作业到不同的队列以达到区分云上作业和本地作业；

有客户有需求在一个E-HPC集群里面运行不同类型作业，每种类型的作业对资源的需求是不同的，例如前处理作业需要普通8核32GiB内存的ECS虚拟机，后端计算性任务需要使用裸金属服务器。
E-HPC支持多队列

E-HPC通过发布以下功能支持多队列部署：

扩容的时候支持指定新的实例规格
创建集群和扩容的时候支持加入指定队列，如果队列不存在会自动创建队列
提交作业的时候支持提交到指定的队列

image.png

自动伸缩服务支持多队列弹性策略的配置，针对每个队列可以配置如下信息：
自动扩容的实例规格
扩容付费类型，是按量付费，或者抢占式实例
如果是抢占式实例，出价策略，是系统自动出价还是设定最高价格
其余的伸缩配置共享集群全局配置，也可以设定部分队列启用自动伸缩，部分队列不自动伸缩

image.png

HPC集群对多队列的支持

E-HPC支持创建部署多种HPC调度集群，不同HPC调度器类型对队列的支持情形是不同的，这里做一些简要的介绍

PBSPro

PBSPro有两种队列类型,

execution: 可执行队列，作业必须在execution队列里才能被分发运行
routing: 用来分发作业到其他队列，目标队列可以是routing或者execution队列

PBSPro默认会创建execution队列workq，该队列默认被创建和启用, 扩容节点时如果没有指定queue，队列workq里的作业可以分发到该节点计算。
以下是pbspro队列相关的命令

 qmgr -c "create queue high queue_type = execution"
 qmgr -c "set queue high started = true"
 qmgr -c "set queue high enabled = true"
 # 设置节点的队列信息为high，将只能运行队列high里的作业
 qmgr -c "set node node001 queue = high"

目前E-HPC对PBSPro集群队列的管理，都是针对execution队列

Slurm

Slurm里对应队列的概念是partitions，partitions可以认为是节点组，将节点分成多个set；partitions也可以被认为是作业队列，可以对该partition下运行的作业设置限制，例如作业运行时间限制，用户权限限制等等。
Slurm默认的partition是comp，所有计算节点都属于该partition
以下是Slurm关于partition的相关配置

 # 创建新的partition，并且指定该partition节点， 但是该配置不是持久化的，重启slurmctld服务就会覆盖该     配置
 scontrol create PartitionName=heavy nodes=compute0

 # 通过修改配置文件的方式
 # 打开文件/opt/slurm/17.02.4/etc/slurm.conf， 可以看到文末的partition配置
 PartitionName=comp Nodes=ALL Default=YES MaxTime=INFINITE State=UP
 可以添加新的partition，例如
 PartitionName=light Nodes=compute0,compute1 Default=YES MaxTime=INFINITE State=UP
 # 重启slurmctld
 system restart slurmctld

LSF/CUBE

LSF或者CUBE的默认队列是normal，所有的节点默认加入该队列，可以配置节点或者节点组加入某个队列，队列配置信息如下

 # 打开队列配置文件lsb.queues （CUBE的配置路径是/opt/cubeman/etc， LSF类似）
 # 增加如下队列配置
 Begin Queue
 QUEUE_NAME   = high
 PRIORITY     = 30
 NICE         = 20
 #QJOB_LIMIT   = 60         # job limit of the queue
 #UJOB_LIMIT   = 5               # job limit per user
 #PJOB_LIMIT   = 2               # job limit per processor
 #RUN_WINDOW   = 5:19:00-1:8:30 20:00-8:30
 #r1m         = 0.7/2.0        # loadSched/loadStop
 #r15m          = 1.0/2.5
 #pg          = 4.0/8
 #ut           = 0.2
 #io          = 50/240
 #CPULIMIT     = 180/apple      # 3 hours of host apple
 #FILELIMIT    = 20000
 #MEMLIMIT     = 5000           # jobs bigger than this (5M) will be niced
 #DATALIMIT    = 20000          # jobs data segment limit
 #STACKLIMIT   = 2048
 #CORELIMIT    = 20000
 #PROCLIMIT    = 5              # job processor limit
 #USERS        = all            # users who can submit jobs to this queue
 HOSTS        = high            # hostgroup high
 #PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
 #POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hey"
 #REQUEUE_EXIT_VALUES = 55 34 78
 DESCRIPTION  = For normal low priority jobs, running only if hosts are \
 lightly loaded.
 End Queue

 # 打开hostgroup配置文件lsb.hosts,最后增加节点组配置（CUBE的配置路径是/opt/cubeman/etc， LSF类似）
 Begin HostGroup
 GROUP_NAME    GROUP_MEMBER    # Key words
 high        (compute0 compute1)    # Define a host group
 End HostGroup

 # 重启服务
 service cubeman restart

SGE(Sun Grid Engine)

SGE默认队列是all.q，默认节点组是@allhosts，所有节点都默认在该节点组
以下是SGE关于队列的相关配置

 # 添加节点组
 qconf -ahgrp @high

 group_name @high
 hostlist compute0 compute1

 # 添加队列
 qconf -aq high
 指定节点组
 hostlist              @high

API调用示例

由于部分客户和合作伙伴是通过OpenAPI和E-HPC对接，这里介绍一下API如何调用, 以python为示例代码，其他语言的示例代码可以通过OpenAPI Explorer来查看其他语言的示例代码

CreateCluster 创建集群

 #!/usr/bin/env python
 #coding=utf-8

 from aliyunsdkcore.client import AcsClient
 from aliyunsdkcore.request import CommonRequest
 client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

 request = CommonRequest()
 request.set_accept_format('json')
 request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
 request.set_method('GET')
 request.set_version('2018-04-12')
 request.set_action_name('CreateCluster')

 # 设置队列，创建的计算节点会被指定为该队列，该队列会被自动创建
 request.add_query_param('JobQueue', 'high')

 # 设置CreateCluster其他参数
 ......

 response = client.do_action_with_exception(request)

AddNodes

 #!/usr/bin/env python
 #coding=utf-8

 from aliyunsdkcore.client import AcsClient
 from aliyunsdkcore.request import CommonRequest
 client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

 request = CommonRequest()
 request.set_accept_format('json')
 request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
 request.set_method('GET')
 request.set_version('2018-04-12')
 request.set_action_name('AddNodes')
 # 设置队列，新扩容的计算节点会被指定为该队列，该队列如果不存在会被自动创建
 request.add_query_param('JobQueue', 'high')

 # 设置AddNodes其他参数
 ......

 response = client.do_action_with_exception(request)

ListQueues

新增API用于查询集群队列列表

 #!/usr/bin/env python
 #coding=utf-8

 from aliyunsdkcore.client import AcsClient
 from aliyunsdkcore.request import CommonRequest
 client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

 request = CommonRequest()
 request.set_accept_format('json')
 request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
 request.set_method('GET')
 request.set_version('2018-04-12')
 request.set_action_name('ListQueues')

 request.add_query_param('RegionId', 'cn-hangzhou')
 request.add_query_param('ClusterId', '<clusterId>')

 response = client.do_action_with_exception(request)

SubmitJob

 #!/usr/bin/env python
 #coding=utf-8

 from aliyunsdkcore.client import AcsClient
 from aliyunsdkcore.request import CommonRequest
 client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

 request = CommonRequest()
 request.set_accept_format('json')
 request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
 request.set_method('GET')
 request.set_version('2018-04-12')
 request.set_action_name('SubmitJob')

 # 指定作业提交到该队列中
 request.add_query_param('JobQueue', 'high')

 # 设置SubmitJob其他参数
 ......

 response = client.do_action_with_exception(request)

SetAutoScaleConfig

 #!/usr/bin/env python
 #coding=utf-8

 from aliyunsdkcore.client import AcsClient
 from aliyunsdkcore.request import CommonRequest
 client = AcsClient('<accessKeyId>', '<accessSecret>','cn-hangzhou')

 request = CommonRequest()
 request.set_accept_format('json')
 request.set_domain('ehpc.cn-hangzhou.aliyuncs.com')
 request.set_method('GET')
 request.set_version('2018-04-12')
 request.set_action_name('SetAutoScaleConfig')

 # 对于队列high，设定扩容实例规格为GPU实例ecs.gn6v-c8g1.8xlargee，按量付费
 request.add_query_param('Queues.1.QueueName', 'high')
 request.add_query_param('Queues.1.InstanceType', 'ecs.gn6v-c8g1.8xlarge')
 request.add_query_param('Queues.1.SpotStrategy', 'NoSpot')
 request.add_query_param('Queues.1.SpotPriceLimit', '0')

 # 对于队列low，设定扩容实例规格为ecs.g5.large，扩容抢占式实例，最高出价为0.1
 request.add_query_param('Queues.2.QueueName', 'low')
 request.add_query_param('Queues.2.InstanceType', 'ecs.g5.large')
 request.add_query_param('Queues.2.SpotStrategy', 'SpotWithPriceLimit')
 request.add_query_param('Queues.2.SpotPriceLimit', '0.1')

 # 设置SetAutoScaleConfig其他参数
 ......
 
 response = client.do_action_with_exception(request)

LSF/CUBE集群的额外设置

LSF/CUBE由于需要license，在创建好集群之后，需要用户手动配置license认证，然后手动配置队列和节点组信息，配置方法在上述章节已经提及，然后后续扩容节点或者自动伸缩就可以做到自动化多队列管理

本文作者：缘督

阅读原文

本文为云栖社区原创内容，未经允许不得转载。