- Cluster Login
- Partitions (Queues) and Compute Nodes
- How to use Environment Module
- How to use PBS TORQUE
- Click here for PBS TORQUE Examples
- How to use SLURM
- Click here for Slurm Examples
- Useful References and Cheatsheets
- Queue Information
At this moment, we have three clusters each of which having a head node. To submit your jobs in a cluster, you should connect to a cluster head node via ssh protocol. To do this, the Microsoft Windows users may use programs such as PuTTY or the free version of MobaXterm and the linux or macOS users should use a terminal (you should also note that you can use Windows 10 command prompt or Powershell after the April 2018 update for the same purpose). The three head nodes are:
- dvorak (for partial cpu cluster).
- gpu (for gpu cluster).
- slurm (for partial cpu cluster).
- The first two head nodes use PBS TORQUE and the third one uses Slurm as workload manager software. In order to connect to a head node type:
<USERNAME>is your cluster userid and
<HEAD-NODE>is one of dvorak, gpu, or slurm. For example, if a user with a userid of ‘abc123’ connects to the Slum head node, they would use:
Partitions (Queues) and Compute Nodes
It was mentioned above that we have three clusters now and this is how we get information about partitions and computes nodes on each of them:
- In dvorak head node, you can use the “cpus.py” command to see all the available partitions and the compute nodes together with information about the running jobs on them. This command also shows the amount of memory and the number of cpus for each compute node. You may also use “qstat -Q” to see the partitions, the number of submitted jobs and currently running jobs in each partition.
- In gpu head node, you can use “gpus.py” command to see a list of available compute nodes together with useful information about gpu cards inside them. Moreover, you can use both “cpus.py” and “qstat -Q” commands mentioned above.
- In slurm head node, you can use “sinfo” or “snodes” commands to see all the partitions and the available compute nodes.
How to use Environment Module
The module package provides a dynamic environment for a user. Practically, this tool creates/removes related environment (variable) settings dynamically. The following examples show how to use module:
shows the available modules
module load anaconda/3
loads anaconda version 3
module unload anaconda/3
lists the loaded modules
unloads all the loaded modules
How to use PBS TORQUE
In order to use PBS Torque you need to use ssh to login to either the dvorak head node (for cpu jobs) or the gpu head node (for gpu jobs) as explained above in the “cluster login” section.
- PBS TORQUE software provides several commands to submit, cancel, or monitor jobs. These are few examples:
submits submit.pbs script
shows all the jobs status
shows all the jobs status with more information
qstat -u user
shows only the status of user’s jobsqdel removes a job from queue
removes a job from queue
qsub -I -q big_memory -l nodes=1:ppn=8,walltime=1:00:00
an interactive job on big_memory partition asking 8 cores for one hour
qsub -I -q dept_gpu -l nodes=1:ppn=10:gpus=1:gtx1080Ti
an interactive job on dept_gpu partition asking 10 cores and one gpu card with “gtx1080Ti” property.
Note: in example 5 above, we used “gtx1080Ti” property of the requested gpu card. To see the properties attached to the available gpus cards you can use the “gpus.py” command (on gpu head node).
In dvorak or gpu head nodes, in order to run a batch job, you need to prepare a submit script using the PBS TORQUE syntax. In this page, we provide few submit script examples to run jobs on cpu and gpu nodes.
How to use SLURM
In order to use Slurm, you need to login to the Slurm head node first as explained above in the “cluster login” section.
submits submit.sh to the queue
squeue -u user
shows the user’s jobs status
shows jobs status with more info
deletes a job
scontrol show job job_id
shows detailed info about a job
scontrol hold job_id
holds a job
scontrol release job_id
releases a job (from being hold)
salloc -p dept_24 --mem=24000MB --ntasks-per-node=10 srun --pty /bin/bash -i
requests an interactive job on dept_24 partition with memory requirement of 24GB and 10 cores
salloc -p dept_gpu --gres=gpu:1 --ntasks-per-node=4 srun --pty /bin/bash -i
requests an interactive job on dept_gpu partition with one gpu card and 4 cores
Slurm has an option called “Feature” which is used to assign one or more flags to a compute node. You can call a “Feature” in your submit
shell using the “––constraint” option. For example, if one or a series of nodes have a feature called “24C”, you can use “––constraint=24C” in your script that the job to be run on one of those nodes. Note that you can use boolean expressions to call features, for instance if you want your job to be run by a node having either 8C or 24C features, you should use “––constraint=8C|24C” and a call to run your jobs on a node having both features use “––constraint=8C&24C”. To find out about the available feature(s) you should use “snodes” command (10th column).
SSH to Node
When you submit a job and Slurm assigns a node to run it, you are able to ssh to the node and monitor your jobs. For instance “ssh @n001” which does ssh to node n001.
In this page, we provide two Slurm scripts in which the first one shows how to run a stress test on a dept_24 node using 24 cores for 120 seconds and the second one demonstrates how to run an array job of dimension four on dept_24 nodes using two cores for 120 seconds. Each line of code has a line of comment.