工具

01. Slurm-集群管理和作业调度系统

2019-12-14  本文已影响0人  GradientDescent

简介

https://slurm.schedmd.com/overview.html

Overview
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Architecture


image

主从式架构,一个primary(slurmctld), 负责作业管理, 多个 nodes(slurmd), 负责执行计算任务, primary有一个可选的backup.

tutorial

https://slurm.schedmd.com/tutorials.html

直接看这份文档 https://www.open-mpi.org/video/slurm/Slurm_EMC_Dec2012.pdf

概念:

SLURM Entities

操作
几种运行模式
其他命令
MPI 支持
发布节奏借鉴

持续集成,定期发布可用特性

构建和安装

Slurm 自带Test Suite, 安装好以后可以用来做回归验证

2019.12.14 Tutorial 看完。

上一篇 下一篇

猜你喜欢

热点阅读