两台全新的Ubuntu1804系统的电脑我不知道两台电脑的CPU信息你可以自己假定给我一个完整的slurm安装教程可以所有软件可以通过apt install安装两台电脑都不能连接互联网主节点ip为101810200计算节点ip为101810199主节点是控制节点的同时也是一个计算节点最后可以实现srun -N 2 oshosh这个文件只有主节点有计算节点没有两台电脑的用户名不同
Slurm是一个开源的Linux集群管理系统,它允许管理者有效地分配计算资源,控制进程并监控集群的性能。在本文中,我们将提供一个完整的Slurm安装教程,以帮助您在两台全新的Ubuntu18.04系统的电脑上安装Slurm。以下是详细步骤:
步骤1:安装Slurm依赖项
在安装Slurm之前,我们需要安装一些依赖项。打开终端并运行以下命令:
sudo apt-get upgrade
sudo apt-get install munge libmunge-dev mariadb-server mariadb-client libmariadb-dev libmysqlclient-dev libmunge2 slurm-wlm slurm-wlm-basic-plugins slurm-wlm-torque```
步骤2:配置Munge
Munge是一个用于身份验证和加密的工具,它是Slurm必需的组件之一。我们需要配置Munge以便它可以与Slurm一起使用。
首先,我们需要创建一个Munge的密钥:
```sudo /usr/sbin/create-munge-key -r```
然后,我们需要将密钥复制到所有节点上。在这里,我们只有两个节点,所以我们需要将密钥复制到两个节点:
```sudo scp /etc/munge/munge.key <username>@10.18.10.199:/etc/munge/
sudo scp /etc/munge/munge.key <username>@10.18.10.200:/etc/munge/```
现在,我们需要在两个节点上启动Munge服务并将其设置为自动启动:
```sudo systemctl start munge
sudo systemctl enable munge```
步骤3:配置MariaDB
Slurm需要MariaDB来存储集群的配置和状态信息。我们需要创建一个名为slurm的数据库和一个名为slurm_acct的账户。
打开终端并运行以下命令:
```sudo mysql_secure_installation```
然后,我们需要创建一个名为slurm的数据库和一个名为slurm_acct的用户。打开终端并运行以下命令:
```sudo mysql -u root -p
MariaDB [(none)]> CREATE DATABASE slurm;
MariaDB [(none)]> GRANT ALL ON slurm.* TO 'slurm_acct'@'localhost' IDENTIFIED BY 'yourpassword';
MariaDB [(none)]> GRANT ALL ON slurm.* TO 'slurm_acct'@'%' IDENTIFIED BY 'yourpassword';
MariaDB [(none)]> FLUSH PRIVILEGES;
MariaDB [(none)]> exit;```
步骤4:配置Slurm
接下来,我们需要配置Slurm。在主节点上打开终端并运行以下命令:
```sudo nano /etc/slurm-llnl/slurm.conf```
然后,将以下内容添加到文件中:
slurm.conf file generated by configurator.html.
Put this file on all nodes of your cluster.
See the slurm.conf man page for more information.
ControlMachine=10.18.10.200 AuthType=auth/munge MpiDefault=none #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs #ProctrackType=proctrack/pgid #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd #SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= #TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=none #TmpFS=/tmp #TrackWCKey=yes #TreeWidth=50 #UnkillableStepProgram= #UsePAM=0
TIMERS
#KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300
SCHEDULING
SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory FastSchedule=1 #DefMemPerCPU=500 #MaxMemPerCPU=1000 #SchedulerRootFilter=1 SchedulerPort=7321 #SchedulerRootOnly=0 #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS=
LOGGING AND ACCOUNTING
#AccountingStorageEnforce=limits #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= #AccountingStorageType= #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= #JobCompType= #JobCompUser= #JobContainerPlugin=job_container/none #JobCredentialD=0 #JobCredentialT= #JobCredentialV= #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel=
POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= #MinSpareNodes= #KillUserProcesses=0
COMPUTE NODES
NodeName=10.18.10.200 CPUs=8 CoresPerSocket=4 ThreadsPerCore=2 Sockets=1 RealMemory=32000 State=UNKNOWN NodeName=10.18.10.199 CPUs=8 CoresPerSocket=4 ThreadsPerCore=2 Sockets=1 RealMemory=32000 State=UNKNOWN PartitionName=debug Nodes=10.18.10.200,10.18.10.199 Default=YES MaxTime=INFINITE State=UP
在文件中,您需要根据您的集群配置修改以下行:
ControlMachine=10.18.10.200 AuthType=auth/munge NodeName=10.18.10.200 CPUs=8 CoresPerSocket=4 ThreadsPerCore=2 Sockets=1 RealMemory=32000 State=UNKNOWN NodeName=10.18.10.199 CPUs=8 CoresPerSocket=4 ThreadsPerCore=2 Sockets=1 RealMemory=32000 State=UNKNOWN PartitionName=debug Nodes=10.18.10.200,10.18.10.199 Default=YES MaxTime=INFINITE State=UP
然后,保存并关闭文件。
步骤5:启动Slurm
在主节点上打开终端并运行以下命令:
```sudo systemctl start slurmctld
sudo systemctl enable slurmctld```
然后,在计算节点上打开终端并运行以下命令:
```sudo systemctl start slurmd
sudo systemctl enable slurmd```
步骤6:测试Slurm
现在,我们已经完成了Slurm的安装和配置。您可以使用以下命令来测试Slurm是否正常工作:
```srun -N 2 hostname```
如果一切正常,您应该看到两个节点的主机名。
现在,您可以使用以下命令来测试Slurm是否可以执行作业:
```srun -N 2 /path/to/o.sh```
这将在两个节点上运行o.sh脚本,该脚本只在主节点上可用。
恭喜!您已经成功地在两台全新的Ubuntu18.04系统的电脑上安装和配置了Slurm,可以开始在您的集群上运行作业了
原文地址: https://www.cveoy.top/t/topic/hvum 著作权归作者所有。请勿转载和采集!