Batch Processing on the Calclabs
Last update: 08Aug2019
Long running batch jobs may be submitted from any of the Calclab login servers, calclabnx.math.tamu.edu. You must request a Calclab account (help //AT// math.tamu.edu) before running jobs there. These jobs will be scheduled on available desktops in the Calclabs. "Availability" is defined as a block of time greater than 3 hours for which no class is scheduled for a particular room. We are using SLURM for batch scheduling. You may login to calclabnx using SSH or X2Go. There should be NO processing on the login servers, as they are for edit/compile/debug, job submission, and data collection purposes only. Any long running processes on the login servers will be terminated without warning and your account will be closed if the abuse continues.
All jobs must be submitted from the /data directory.
mkdir /data/scratch/$USER
The $USER variable above expands to your username.
Copy your programs, input files, and batch scripts to that directory before
submitting your jobs. Note that files under /data
are not backed up, so you are responsible for copying important data elsewhere.
Your home directory is not accessible from the batch jobs due to
security policies that are in place on the Calclab desktops.
Please consider checkpointing your application appropriately as nodes may go down during the running of your job.
The SLURM partitions are configured as follows:
Partition | Time limit | Soft Mem limit |
---|---|---|
night | 9 hours | 1.9GB/job |
weekend | 42 hours | 1.9GB/job |
Please check this page for updates to the queue configuration.
Serial Job Example
Jobs are submitted using the sbatch command.
Example (serial job):
sbatch -p night myserjob.slrm
The contents of the myserjob.slrm file for a serial (single node) application may look something like this:
#SLURM --time=02:00:00 #SLURM -p night cd /data/scratch/$USER/mysubdir ./myprog -j 1 -f outfile.lis << EOT 2 45.5 62 infile.txt 14 EOT exit 0
In the above example, the #SLURM directives are interpreted by sbatch command and do not need to be specified on the command line. Here we're setting a walltime limit of 2 hours and using the night queue. When the job starts we change to the /data/scratch/$USER/mysubdir subdirectory and execute myprog in that directory with the command line arguments -j 1 -f outfile.lis. The lines between the two EOT tags contain the input to the program that would normally be read when myprog is executed interactively.
Job Status
You can see the status of your job using the squeue command.
Cancel a Job
You can cancel your queued jobs ('Q' status from squeue) by using the scancel jobID where jobID is also found from the qstat command. You can send Unix signal 15 to a running job ('R' status) with scancel -s 15 jobID.
Matlab Job Example
Matlab is only available to Texas A&M University faculty, staff, and students. This example is similar to the one above. We will be running a serial job in which we call Matlab.
#SLURM --time=00:01:00 #SLURM -p night matlab -nojvm < /data/scratch/$USER/myfile.m > /data/scratch/$USER/matlab.out exit 0
We start matlab with the -nojvm option to prevent loading the Java VM. The input for matlab is read from /data/scratch/$USER/myfile.m and output is stored in /data/scratch/$USER/matlab.out.
Job Arrays
A job array is created using the --array=<indices> option to sbatch. The indices can be a range such as 0-7, or distinct values, 0,2,4,6,8. The stdout and stderr files can use SLURM variables and can be specified as -o jarray-%A-%a.out. The %A represents the Array Job ID, and %a is the task ID within the job.
Consider the example where a 3-task job is submitted with sbatch --array=0-2. If the SLURM job ID assigned to this is 151, then the following environment variables would be set in each task within the job array:
#!/bin/bash #SBATCH -J jarray #SBATCH -p night #SBATCH -o jarray-%A-%a.out #SBATCH --time=00:01:00 #SBATCH -N1 #SBATCH --ntasks-per-core=1 #SBATCH --mem-per-cpu=100 echo "starting at `date` on `hostname`" echo "SLURM_JOBID=$SLURM_JOBID" echo "SLURM_ARRAY_JOB_ID=$SLURM_ARRAY_JOB_ID" echo "SLURM_ARRAH_TASK_ID=$SLURM_ARRAY_TASK_ID" echo "srun -l /bin/hostname" srun -l /bin/hostname sleep 30 echo "ended at `date` on `hostname`" exit 0
Here's the command sequences for submitting the job:
$ sbatch --array=0-2 jarray.slrm Submitted batch job 44 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 44_0 night jarray s-johnso R 0:02 1 bloc122-00 44_1 night jarray s-johnso R 0:02 1 bloc122-00 44_2 night jarray s-johnso R 0:02 1 bloc122-00 $ squeue # no output, job is complete JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ ls jarray-* jarray-44-0.out jarray-44-1.out jarray-44-2.out $ cat jarray-44-1.out starting at Tue Jan 7 14:13:57 CST 2014 on bloc122-00 SLURM_JOBID=45 SLURM_ARRAY_JOB_ID=44 SLURM_ARRAH_TASK_ID=1 srun -l /bin/hostname 0: bloc122-00 ended at Tue Jan 7 14:14:27 CST 2014 on bloc122-00
To cancel one or more (or all) elements of job array:
# Submit job array and check queue $ sbatch --array=0-7 jarray.slrm $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 47_0 night jarray s-johnso R 0:02 1 bloc122-00 47_1 night jarray s-johnso R 0:02 1 bloc122-00 47_2 night jarray s-johnso R 0:02 1 bloc122-00 47_3 night jarray s-johnso R 0:02 1 bloc122-00 47_4 night jarray s-johnso R 0:02 1 bloc122-00 47_5 night jarray s-johnso R 0:02 1 bloc122-00 47_6 night jarray s-johnso R 0:02 1 bloc122-00 47_7 night jarray s-johnso R 0:02 1 bloc122-00 # Cancel elements 5,6,7 $ scancel 47_[5-7] # See that they're gone $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 47_0 night jarray s-johnso R 0:16 1 bloc122-00 47_1 night jarray s-johnso R 0:16 1 bloc122-00 47_2 night jarray s-johnso R 0:16 1 bloc122-00 47_3 night jarray s-johnso R 0:16 1 bloc122-00 47_4 night jarray s-johnso R 0:16 1 bloc122-00 # Cancel the entire remaining array $ scancel 47 # All gone! $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Parallel Job Example
For a parallel job using MPI for message passing, the script myparajob.pbs would issue the mpirun command as follows:
#SLURM -N 8 #SLURM --mem=4000 #SLURM --time=00:10:00 #SLURM -p weekend cd /data/scratch/$USER/mysubdir mpirun -np 8 ./myparaprog -j 1 -f outfile.list << EOT 6.6 infile.txt -1 20 qfile.out EOT exit 0
This job is submitted with:
sbatch myparajob.pbs
In this example, we set a walltime limit of 10 minutes and request the job be started from the night queue on 8 nodes. When SLURM starts your job, it will allocate the nodes to be used.
Node Status
You can view the status of the cluster on the Status page. This page is only available on campus and via TAMU VPN.
Appropriate Use
The system administrators reserve the right to monitor all processes for appropriate use of the resources. Appropriate use is defined as legitimate academic work appropriate for Texas A&M University. Cracking MD5, mining bitcoins, or running apps such as Folding@Home are not considered legitimate. Abuse of the batch system will result in account termination.