Guidance on using the CQLS queuing system
The CQLS queuing system
Queuing systems allow for multiple users to make use of multiple compute nodes
across an infrastructure. At the CQLS, we log in to the
shell.cqls.oregonstate.edu machine, currently called
users can submit batch or array jobs (1 or many tasks, respectively) to the,
queuing system such that they run autonomously, and will continue running even
when users log off.
Please see this previous post with some information regarding the CQLS infrastructure along with some of the commands that are a part of the SGE queuing system.
How does one use the queuing system?
The CQLS is currently using the Son of Grid Engine (SGE) queuing system. To
facilitate the use of the queuing system, the CQLS wrote and maintains two
helper scripts for job submission:
SGE_Batch is designed to support a job submission with individual tasks (one
can submit array jobs using
SGE_Batch, but we recommend using
SGE_Array is designed for submitting array jobs including multiple
These wrapper scripts write a shell script to the filesystem and submit a job to
the queuing system for you according to the options you requested. In the
background, the scripts call the
qsub command for you with the shell script.
To see what resources are available to you, use the
SGE_Avail script. To
monitor currently running jobs, use
qstat, and to kill jobs, use
the above post linked above for more information regarding these underlying
To check out a node interactively, use the
qrsh command. You can specify a
-q QUEUE where
QUEUE is the name of the queue you’d like to check out a
node from. You can check out multiple processors on a machine using the
qrsh -pe thread N option, where N is the number of processors you need.
To land on a specific node on a queue, use the
-q QUEUE@node syntax, where
QUEUE is e.g.
bpp and node is e.g.
symbiosis. This syntax can be used
qrsh and the
SGE_Batch helper scripts.
Please make sure to exit the node when you are finished so that the queue can reclaim the processors for others to use.
We’ve established what the queuing system is and what commands one can use to interact with the queuing system. Let’s put these scripts to use in some hypothetical scenarios that may be related to your own research.
Naming your jobs (run-name)
You must specify a run-name (
-r flag) for each job that you submit. I suggest
using a syntax whereby each job submitted to SGE has a run-name prefix with
sge.; while using the infrastructure for 10+ years, I have had to find
different ways to deal with having these run directories in my various
workflows. What I found is that having the directories always prefixed with
sge. allows me to identify them even in a directory full of other input and
Additionally, if one uses sample names as part of the run-names, one might find
that there are errors whenever a sample name starts with a number. This is
because numbers are disallowed in the run-names by SGE. Therefore, prefixing all
sge. will alleviate this issue.
I also generally suggest including the program that one is running in the run-name.
Therefore, a standard syntax that one might adopt is
This will also help identify what jobs are running more easily when one runs
qstat, although the names sometimes do get cut-off in that output.
You can run the
qstat-long command to get the full names for jobs.
When you submit a job, it might be useful to use the
watch qstat command to
monitor the job progress until your job enters the
Matching command settings and SGE settings
SGE does not and cannot know what settings you applied to your commands in terms
of CPU and/or memory utilization. Therefore, whenever you are submitting a
command to the queue, it is a good idea to set your number of CPUs or threads
(often using the
-t option, this will be program dependent) and
-P option of
SGE_Array. Setting a higher value of
will not make your program automatically use more threads. Furthermore, setting
the number of threads of your program to a higher value without the corresponding
-P will cause the server to be overloaded.
Standard output and standard error
Your standard output and error will end up in the run-name directory that you
specify on job creation. The STDOUT will be in the files with the
suffix and the STDERR will be in the files with the
Generally, it’s useful to specify your output to go to a file, either using
STDOUT redirection (i.e.
command > output.txt) or using the program specific
command -o output.txt) instead of pulling the output from the
Use the local /data drives on compute nodes
In order to speed up computation and reduce the network congestion, it can be
useful to specify the local
/data drive as input or output for your commands.
If you are running a command from a
/home directory, then both the
reads and writes to disk will happen over the network. With processes that are
especially heavy on input/output (I/O), this can cause slowdowns and disk waits
that will reduce efficiency of both your own processing and others using the
same compute nodes as you.
One common high I/O process is genome assembly using short reads, like the
SPAdes genome assembler. The authors of
this software suggest using local hard drives, like
/data, for processing
using SPAdes. At minimum, it will be useful to specify the output directory on
/data drive. For the most speed-up, you can also copy the input files to
/data drive as well, prior to assembly job submssion. NOTE: You will
have to remember which node you copied the reads over to and the output was
written to so that you can copy the outputs off afterward.
Some programs also allow you to specify a temporary directory to write
intermediates to, and do not always use the
$TMPDIR environment variable as
default, so it can be useful to specify the temp dir as
/data manually when
that option is available.
SGE job states listed using qstat
The common SGE job states are:
|Category||State||SGE Letter Code|
|Pending||pending, queue wait||qw|
|Pending||pending, holding (for other job) queue wait||hqw|
|Error||all pending states with error||Eqw, Ehqw|
|Deleted||all running and suspended states with deletion||dr, dt|
The most commonly seen states will be
qw of queue wait and
r of running.
At times, you might find an
E state (along with other codes) that indicates
some issue with job submission. You can run
qstat -j $JOBID for that job and
figure out why that job has an error (
error and you will find the
Sometimes you can use the
clear_Eqw_job.sh script to clear the errors and the
will continue as submitted, but often you have to kill and resubmit the jobs.
If you find jobs stuck in the
d state, the node you are trying to delete the
job on is likely waiting for disk I/O and is unresponsive. The node may need
to be restarted.
SGE_Avail queue states
|State||SGE Letter Code||Note|
|alarm||a||Too many resources being used (load average too high)|
|unreachable||u||Machine is offline or off network|
|Error||E||Transfer and/or network error most often|
|normal||normal||All is well|
If a queue is in error state, you can run
qstat -explain E to find more
information about the error. If you see
a state, you can run
qstat -explain a (or
qstat -explain aE for both). The
-explain flag can be
-q flag as well to get information about specific queues.
qstat -explain aE | grep -B 1 -e alarm -e ERROR shows all
Individual job submission
There are several different ways to successfully submit a job through
The two most common methods are:
- Include the bare command in quotes as the
- Write the commands in a bash script (e.g.
run.sh) and submit the bash script as the
SGE_Batch -c 'bash ./run.sh' ...)
For one-off commands, I think the utility of the bare
-c command is obvious
because of the ease of use and quick access.
For jobs as part of a larger analysis, it may be useful to include the commands
in a bash script, thus saving a record of what you ran and making it easier to
re-run the commands in the future. In addition, if you are on a node
qrsh, you can run the command in the bash script without
having to retype the command or scroll through your command history.
An additional benefit of using a shell script to store your commands is that it is easier to run commands serially, e.g.
when compared to running several
SGE_Batch -c 'command'... after the previous
Finally, if you need to use a piece of software that is in a conda environment, you can include the commands to activate the environment in the bash script.
In this way, you can record what conda environment needs to be activated for the program to work properly.
Multiple job (task) submission
As with many workflows, you may have multiple files to process through the same
program. This will be the case with any project where you have multiple samples.
Instead of submitting multiple (thousand)
SGE_Batch commands, it can be useful
to organize your job submissions using
SGE_Array allows you to submit a single job that includes mutliple tasks (in
SGE terminology) that will be grouped. This can be especially useful when doing
some sort of large comparative genomics project with thousands of individual
tasks - you can group the tasks in a single job and not take over the entire
infrastructure by using the concurrency flag (
-b; 50 by default). As a general
rule, if you are using a lot of resources per job, you can limit your concurrency
to a smaller number so others can use the infrastructure. Conversely, if your
jobs finish quickly and are unlikely to stay in the queue for a long period
(say, > 24h), then you can increase the concurrency flag to speed up the
Let’s assume you have some predicted proteomes (NCBI .faa format) and you want
to characterize their predicted functions using interproscan. Here is one way
you could run them all through using
Here is our directory structure:
Here is the contents of
You’ll note that this just runs
echo to print the commands to the terminal,
which is a good thing because it allows you to check your command before you
submit it. For example, while writing this tutorial, I neglected to include the
-i $faa at the end of the command such that I did not provide any input files
to the command.
Let’s give the script a try:
So far so good. Now we can submit the jobs using
SGE_Array, which takes STDIN
as the commands as default.
One nice feature of
SGE_Array is that you will get an automatic run-name if you
do not provide a run-name with the
-r flag. The downside is you will have to
command.N files inside the directory to figure out what you ran.
Let’s take a look at the SGE directory:
Each of the
command.N.txt files corresponds to the commands taken in as input
on the commandline when you submit the
SGE_Array command (all of which are
found in the
jobname.eJOBID.N files corresponds to the stdout
and stderr of each task, respectively.
While you can get by with the above commands and protocols, there are other ways that one can use the queuing system as well.
bash scripting for automation
One method for automating the job submission process takes two distinct steps:
- Writing a shell script that includes the program, options, env vars, and activating conda envs
- Writing a shell script to pass all input files through the script using SGE.
Then, to submit the commands to SGE
bash ./submit_bakta.sh | SGE_Array -P 16 -q ...
or, to preserve a record of the commands
or, if you don’t need to run multiple jobs at once, and can run them serially
Using this two step approach is flexible, as you can see.
Snakemake is a job running system that can be configured to submit jobs through SGE. The snakemake system is too large of a topic to cover here, however, here is some information that might get you in the right direction.
default cluster config:
job specific config:
Using the just command runner
I have recently taken to using the just command runner for my analysis, and other, jobs. One difficulty I’ve experienced over the years of bioinformatics work is keeping track of individual commands that I have run. For me, making each and every command as a shell script to keep a record of what I’ve done has not been possible or even desireable.
One of the main tasks that we have to deal with is converting files between
formats. While there are tools that can assist us in this, often we end up using
a pipeline of
paste to join our commands
just doesn’t completely eliminate barriers to writing the commands
in a file, it certainly has the benefit of not cluttering the working directory
with shell scripts. All commands can be contained in a file called
then can be invoked using
just $command, as might be expected when selecting
options from any program. Further, arguments can be provided (and defaults set)
so that you can re-use your
justfiles if desired.
While there is some syntax and language-specific syntax to be learned (mainly,
commands are submitted from the directory the
justfile is in rather than the
pwd), you can use
just similarly to your standard
bash scripts except that
you get a command-line parser built in for you.
Real world example
Rather than re-writing the documentation that you can find by following the link above, I’ll walk you through a scenario from today.
Here is my directory structure (some files removed for readability):
I needed to run a genome hashing algorithm so I could compare some complete and incomplete genomes and later cluster them based on the hash distance.
Here is the justfile I wrote:
You’ll see some parameters at the top, and then three commands, the default, which
just lists the available commands, and then
This is what you see when you run
just on the command line:
The default command is hidden because it is prefixed with and underscore in the
I was then able to submit the
run-dist command in this way:
SGE_Batch -c 'just run-dist' -q bact -P 16 -r sge.run-dashing-dist
Since I was unfamiliar with the outputs, I used
qrsh to check out a node and
navigate to the directory where the outputs were. I realized I needed to modify
the output files, and instead of running the commands individually, I wrote
convert command above. Then, I could just run the command using
just convert from the commandline. Since I had a node checked out, I did not
have to worry about taking up too many resources (although the convert command
could easily run on vaughan as well).
Additionally, I was able to re-run the
run-dist command with several
settings and examine the outputs just by changing the options in the
One could extend this idea using the
bash scripting for automation section above,
and have a
run-command recipe that has the commands necesary for running the
command, and a
submit-command recipe that echos the command so it’s easy to
SGE; I’ve done this in the past and it has been helpful. You can
run recipes while in other recipes, making this set up even easier.