Wildwood HPC Cluster

The hpcman software was developed, in part, to help ease migration to the new Wildwood HPC infrastructure at the CQLS.

The SGE queuing system is presently in use on the CQLS research and teaching infrastructures. SGE will be available on the new 'Wildwood' HPC infrastructure, but we are encouraging labs to move their compute nodes over to the Slurm queuing system. SGE will primarily be intended for use of legacy workflows that are too costly or difficult to migrate to Slurm.

In order to facilitate this migration, the hpcman queue set of commands was developed to replace the SGE_Batch, SGE_Array, and SGE_Avail commands. Both SGE and Slurm jobs are able to be submitted and monitored using the hpcman queue commands.

Submitting jobs

Jobs, by default, are sent to Slurm partitions on the 'Wildwood' HPC cluster, when hpcman queue submit (alias: hqsub) is used. This option can be toggled using the --queuetype flag. See the user-guide or the command line help hqsub -h for more information.

Under the hood, the hqsub software is generating a SGE- or Slurm-compatible job script and then submitting the script using qsub or sbatch, respectively.

Monitoring jobs

Jobs of both SGE and Slurm can be monitored using the hpcman queue stat (alias: hqstat) command.

Jobs are queried using qstat or squeue for SGE and Slurm, respectively. Job attributes are then collated into a table and displayed together, such that jobs from both queuing systems can be monitored at the same time.

Finding available resources

Resources for both type of queuing systems can be found using the hpcman queue avail (alias: hqavail) command.

Resources for SGE are gathered using the qstat -f command more details here.

Resources for Slurm are gathered using the sinfo -Nl command.

Stopping jobs

SGE jobs are killed using the qdel -j $JOBID command, while Slurm jobs are stopped using the scancel $JOBID command.

These features have not yet been integrated into the hpcman queue software, but they are on the roadmap.