Using the Cluster

Contents

    1. Cluster Login
    2. Partitions (Queues) and Compute Nodes
    3. How to use Environment Module
    4. How to use PBS TORQUE
      1. Click here for PBS TORQUE Examples
    5. How to use SLURM
      1. Click here for Slurm Examples
    6. Useful References and Cheatsheets
    7. Queue Information

    Cluster Login

    At this moment, we have three clusters each of which having a head node. To submit your jobs in a cluster, you should connect to a cluster head node via ssh protocol. To do this, the Microsoft Windows users may use programs such as PuTTY or the free version of MobaXterm and the linux or macOS users should use a terminal (you should also note that you can use Windows 10 command prompt or Powershell after the April 2018 update for the same purpose). The three head nodes are:

    1. dvorak (for partial cpu cluster).
    2. gpu (for gpu cluster).
    3. slurm (for partial cpu cluster).
  1. The first two head nodes use PBS TORQUE and the third one uses Slurm as workload manager software. In order to connect to a head node type: ssh <USERNAME>@<HEAD-NODE>.csb.pitt.edu where <USERNAME> is your cluster userid and <HEAD-NODE> is one of dvorak, gpu, or slurm. For example, if a user with a userid of ‘abc123’ connects to the Slum head node, they would use: ssh abc123@slurm.csb.pitt.edu.

    Partitions (Queues) and Compute Nodes

    It was mentioned above that we have three clusters now and this is how we get information about partitions and computes nodes on each of them:

    1. In dvorak head node, you can use the “cpus.py” command to see all the available partitions and the compute nodes together with information about the running jobs on them. This command also shows the amount of memory and the number of cpus for each compute node. You may also use “qstat -Q” to see the partitions, the number of submitted jobs and currently running jobs in each partition.
    2. In gpu head node, you can use “gpus.py” command to see a list of available compute nodes together with useful information about gpu cards inside them. Moreover, you can use both “cpus.py” and “qstat -Q” commands mentioned above.
    3. In slurm head node, you can use “sinfo” or “snodes” commands to see all the partitions and the available compute nodes.

    How to use Environment Module

    The module package provides a dynamic environment for a user. Practically, this tool creates/removes related environment (variable) settings dynamically. The following examples show how to use module:

      1. module avail
        shows the available modules
      2. module load anaconda/3
        loads anaconda version 3
      3. module unload anaconda/3
        unloads anaconda
      4. module list
        lists the loaded modules
      5. module purge
        unloads all the loaded modules

    How to use PBS TORQUE

    In order to use PBS Torque you need to use ssh to login to either the dvorak head node (for cpu jobs) or the gpu head node (for gpu jobs) as explained above in the “cluster login” section.

  2. PBS TORQUE software provides several commands to submit, cancel, or monitor jobs. These are few examples:
    1. qsub submit.pbs
      submits submit.pbs script
    2. qstat
      shows all the jobs status

      1. qstat -nt1
        shows all the jobs status with more information
      2. qstat -u user
        shows only the status of user’s jobsqdel removes a job from queue
    1. qdel <JOBID>
      removes a job from queue
    2. qsub -I -q big_memory -l nodes=1:ppn=8,walltime=1:00:00
      an interactive job on big_memory partition asking 8 cores for one hour
    3. qsub -I -q dept_gpu -l nodes=1:ppn=10:gpus=1:gtx1080Ti
      an interactive job on dept_gpu partition asking 10 cores and one gpu card with “gtx1080Ti” property.

Note: in example 5 above, we used “gtx1080Ti” property of the requested gpu card. To see the properties attached to the available gpus cards you can use the “gpus.py” command (on gpu head node).

In dvorak or gpu head nodes, in order to run a batch job, you need to prepare a submit script using the PBS TORQUE syntax. In this page, we provide few submit script examples to run jobs on cpu and gpu nodes.

How to use SLURM

In order to use Slurm, you need to login to the Slurm head node first as explained above in the “cluster login” section.

  • To use Slurm workload manager, you need to use Slurm commands together with writing a submit shell using the Slurm syntax. These are few examples of Slurm commands:
    1. sbatch submit.sh
      submits submit.sh to the queue
    2. squeue -u user
      shows the user’s jobs status
    3. sjobs
      shows jobs status with more info
    4. scancel job_id
      deletes a job
    5. scontrol show job job_id
      shows detailed info about a job
    6. scontrol hold job_id
      holds a job
    7. scontrol release job_id
      releases a job (from being hold)
    8. salloc -p dept_24 --mem=24000MB --ntasks-per-node=10 srun --pty /bin/bash -i
      requests an interactive job on dept_24 partition with memory requirement of 24GB and 10 cores
    9. salloc -p dept_gpu --gres=gpu:1 --ntasks-per-node=4 srun --pty /bin/bash -i
      requests an interactive job on dept_gpu partition with one gpu card and 4 cores
  • Slurm Feature

    Slurm has an option called “Feature” which is used to assign one or more flags to a compute node. You can call a “Feature” in your submit
    shell using the “––constraint” option. For example, if one or a series of nodes have a feature called “24C”, you can use “––constraint=24C” in your script that the job to be run on one of those nodes. Note that you can use boolean expressions to call features, for instance if you want your job to be run by a node having either 8C or 24C features, you should use “––constraint=8C|24C” and a call to run your jobs on a node having both features use “––constraint=8C&24C”. To find out about the available feature(s) you should use “snodes” command (10th column).

    SSH to Node

    When you submit a job and Slurm assigns a node to run it, you are able to ssh to the node and monitor your jobs. For instance “ssh @n001” which does ssh to node n001.

    In this page, we provide two Slurm scripts in which the first one shows how to run a stress test on a dept_24 node using 24 cores for 120 seconds and the second one demonstrates how to run an array job of dimension four on dept_24 nodes using two cores for 120 seconds. Each line of code has a line of comment.

    Useful References & Cheatsheets