整个事件的过程是这样的
# srun hostname
srun: Required node not available (down, drained or reserved)
srun: job 58 queued and waiting for resources
squeue
58 compute hostname root PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
作业所需的节点已关闭、耗尽或保留给优先级较高的分区中的作业
scontrol show jobs
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
control up infinite 1 drain* m1
compute* up infinite 1 drain c1
scancel 命令与这个作业 ID 来终止该作业步骤
scancel 58
scontrol update NodeName=m1 State=idle
/var/log/slurmctld.log
error: Nodes m1 not responding
scontrol show node
计算节点的状态 Reason=Low socket*core*thread count, Low CPUs [slurm@2021-09-15T15:18:53]
控制节点的状态 Reason=Not responding [slurm@2021-10-12T14:34:34]
最后查一下服务器的资源配置跟slurm.conf里面的声明的配置
发现slurm.conf里面的配置过高,重新修改一下
# vim /etc/slurm/slurm.conf
修改如下部分
ControlMachine=m1
ControlAddr=192.168.8.150
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
NodeName=m1 NodeAddr=192.168.8.150 CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
NodeName=c1 NodeAddr=192.168.8.145 CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
PartitionName=control Nodes=m1 Default=NO MaxTime=INFINITE State=UP
PartitionName=compute Nodes=c1 Default=YES MaxTime=INFINITE State=UP
把CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200这几个参数的值调低一些就行了。
注意:
1.如果修改了配置文件slurm.conf,则请在master上执行scontrol reconfig命令更新配置文件。
2.目前集群所有机器的配置文件是一样的,如果修改了请把所有机器的conf都相应修改掉