Batch Processing on the Calclabs

Last update: 08Aug2019

Long running batch jobs may be submitted from any of the Calclab login servers, calclabnx.math.tamu.edu. You must request a Calclab account (help //AT// math.tamu.edu) before running jobs there. These jobs will be scheduled on available desktops in the Calclabs. "Availability" is defined as a block of time greater than 3 hours for which no class is scheduled for a particular room. We are using SLURM for batch scheduling. You may login to calclabnx using SSH or X2Go. There should be NO processing on the login servers, as they are for edit/compile/debug, job submission, and data collection purposes only. Any long running processes on the login servers will be terminated without warning and your account will be closed if the abuse continues.

All jobs must be submitted from the /data directory.
mkdir /data/scratch/$USER
The $USER variable above expands to your username. Copy your programs, input files, and batch scripts to that directory before submitting your jobs. Note that files under /data are not backed up, so you are responsible for copying important data elsewhere. Your home directory is not accessible from the batch jobs due to security policies that are in place on the Calclab desktops.

Please consider checkpointing your application appropriately as nodes may go down during the running of your job.

The SLURM partitions are configured as follows:

Partition Time limit Soft Mem limit
night 9 hours 1.9GB/job
weekend 42 hours 1.9GB/job

Please check this page for updates to the queue configuration.

Serial Job Example

Jobs are submitted using the sbatch command.
Example (serial job):
sbatch -p night myserjob.slrm

The contents of the myserjob.slrm file for a serial (single node) application may look something like this:

#SLURM --time=02:00:00
#SLURM -p night
cd /data/scratch/$USER/mysubdir
./myprog -j 1 -f outfile.lis << EOT
2
45.5 62
infile.txt
14
EOT
exit 0

In the above example, the #SLURM directives are interpreted by sbatch command and do not need to be specified on the command line. Here we're setting a walltime limit of 2 hours and using the night queue. When the job starts we change to the /data/scratch/$USER/mysubdir subdirectory and execute myprog in that directory with the command line arguments -j 1 -f outfile.lis. The lines between the two EOT tags contain the input to the program that would normally be read when myprog is executed interactively.

Job Status

You can see the status of your job using the squeue command.

Cancel a Job

You can cancel your queued jobs ('Q' status from squeue) by using the scancel jobID where jobID is also found from the qstat command. You can send Unix signal 15 to a running job ('R' status) with scancel -s 15 jobID.

Matlab Job Example

Matlab is only available to Texas A&M University faculty, staff, and students. This example is similar to the one above. We will be running a serial job in which we call Matlab.

#SLURM --time=00:01:00
#SLURM -p night

matlab -nojvm < /data/scratch/$USER/myfile.m > /data/scratch/$USER/matlab.out

exit 0

We start matlab with the -nojvm option to prevent loading the Java VM. The input for matlab is read from /data/scratch/$USER/myfile.m and output is stored in /data/scratch/$USER/matlab.out.

Job Arrays

A job array is created using the --array=<indices> option to sbatch. The indices can be a range such as 0-7, or distinct values, 0,2,4,6,8. The stdout and stderr files can use SLURM variables and can be specified as -o jarray-%A-%a.out. The %A represents the Array Job ID, and %a is the task ID within the job.

Consider the example where a 3-task job is submitted with sbatch --array=0-2. If the SLURM job ID assigned to this is 151, then the following environment variables would be set in each task within the job array:

#!/bin/bash
#SBATCH -J jarray
#SBATCH -p night
#SBATCH -o jarray-%A-%a.out
#SBATCH --time=00:01:00
#SBATCH -N1
#SBATCH --ntasks-per-core=1
#SBATCH --mem-per-cpu=100

echo "starting at `date` on `hostname`"

echo "SLURM_JOBID=$SLURM_JOBID"
echo "SLURM_ARRAY_JOB_ID=$SLURM_ARRAY_JOB_ID"
echo "SLURM_ARRAH_TASK_ID=$SLURM_ARRAY_TASK_ID"

echo "srun -l /bin/hostname"
srun -l /bin/hostname
sleep 30
echo "ended at `date` on `hostname`"
exit 0

Here's the command sequences for submitting the job:

$ sbatch --array=0-2 jarray.slrm
Submitted batch job 44
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              44_0     night   jarray s-johnso  R       0:02      1 bloc122-00
              44_1     night   jarray s-johnso  R       0:02      1 bloc122-00
              44_2     night   jarray s-johnso  R       0:02      1 bloc122-00
$ squeue   # no output, job is complete
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ ls jarray-*
jarray-44-0.out  jarray-44-1.out  jarray-44-2.out
$ cat jarray-44-1.out
starting at Tue Jan  7 14:13:57 CST 2014 on bloc122-00
SLURM_JOBID=45
SLURM_ARRAY_JOB_ID=44
SLURM_ARRAH_TASK_ID=1
srun -l /bin/hostname
0: bloc122-00
ended at Tue Jan  7 14:14:27 CST 2014 on bloc122-00

To cancel one or more (or all) elements of job array:

# Submit job array and check queue
$ sbatch --array=0-7 jarray.slrm
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              47_0     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_1     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_2     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_3     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_4     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_5     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_6     night   jarray s-johnso  R       0:02      1 bloc122-00
              47_7     night   jarray s-johnso  R       0:02      1 bloc122-00
# Cancel elements 5,6,7
$ scancel 47_[5-7]
# See that they're gone
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              47_0     night   jarray s-johnso  R       0:16      1 bloc122-00
              47_1     night   jarray s-johnso  R       0:16      1 bloc122-00
              47_2     night   jarray s-johnso  R       0:16      1 bloc122-00
              47_3     night   jarray s-johnso  R       0:16      1 bloc122-00
              47_4     night   jarray s-johnso  R       0:16      1 bloc122-00
# Cancel the entire remaining array
$ scancel 47
# All gone!
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Parallel Job Example

For a parallel job using MPI for message passing, the script myparajob.pbs would issue the mpirun command as follows:

#SLURM -N 8
#SLURM --mem=4000
#SLURM --time=00:10:00
#SLURM -p weekend
cd /data/scratch/$USER/mysubdir
mpirun -np 8 ./myparaprog -j 1 -f outfile.list << EOT
6.6
infile.txt
-1 20
qfile.out
EOT
exit 0

This job is submitted with:
sbatch myparajob.pbs

In this example, we set a walltime limit of 10 minutes and request the job be started from the night queue on 8 nodes. When SLURM starts your job, it will allocate the nodes to be used.

Node Status

You can view the status of the cluster on the Status page. This page is only available on campus and via TAMU VPN.

Appropriate Use

The system administrators reserve the right to monitor all processes for appropriate use of the resources. Appropriate use is defined as legitimate academic work appropriate for Texas A&M University. Cracking MD5, mining bitcoins, or running apps such as Folding@Home are not considered legitimate. Abuse of the batch system will result in account termination.