The CQLS infrastructure and you

2020-04-04 2472 words 12 minutes

Contents

Preface

So you’ve requested an account on the CQLS infrastructure and have the account information email in your inbox and you don’t know where to start. Well, you’re in the right place!

Get familiar with the CQLS infrastructure policies before you begin your journey.

Ideally, you would have a more senior member of your lab or maybe another grad student in your department to go to for the ins-and-outs of the infrastructure. If you don’t, fear not, because this document should have all of the information and resources that you need to become the wisened senior lab member yourself! Either way, if your professor has file space and/or compute nodes on the infrastructure, you need to have them either send an email to Chris Sullivan with your name and the username that you signed up for or submit a support ticket with that information here:
Support Request <– That will gain you access to the biocomputing resources you need.

NOTE This document sources the wonderful ‘Primer for computational biology’ written by our own Shawn O’Neil. You can find that book here: https://open.oregonstate.education/computationalbiology/

I will point you to the most relevant passages to get you up and running the quickest, but feel free to browse other sections as you find them helpful.

Assumptions

I’m going to assume that 1) you know how to connect to the infrastructure using ssh in some form and that 2) you know how to type commands and navigate a command-line interface. If these are unfair assumptions, you can register for a course/workshop or utilize these other resources:

Introduction to Unix/Linux and Command-Line Data Analysis taught by Matthew Peterson
Contact a CQLS trainer or analyst for some one on one training
Join the CQLS-BUG mailing list or Slack channel and ask for help - OSU-CQLS-BUG slack channel,
Check out the section about logging in to an infrastructure from the bioinformatics primer

If you are on Windows 10, I would recommend installing the Linux subsystem for Windows as described here: https://docs.microsoft.com/en-us/windows/wsl/install-win10 with Ubuntu 18.04 LTS as the recommended Linux distribution. The reason for this over say, PuTTY, is that you will be able to install native Linux programs on your own computer such that the experience will mirror that of the infrastructure. You can also practice using the command-line and command-line tools on your own computer, as well as practice coding in python, perl, etc much easier than in Windows 10. ssh should be installed by default, and you can use the bash shell to get access to it.

Other alternatives are the venerable PuTTY, terminus, or something like the new Microsoft Terminal with the new built-in SSH client, if you don’t want to go all-in on the Linux subsystem.

Connecting to the CQLS infrastructure

When you sign up for an account, you will receive instructions on how to connect to the infrastructure. Do NOT lose these instructions, as the instructions have key pieces of data that you will need to successfully login in the future. Follow the instructions to connect and change your password. If you aren’t familiar with the command-line at all, take some time to familiarize yourself with this information from the primer -> the command line and filesystem

Here are some good commmand-line tips & guidelines: https://github.com/jlevy/the-art-of-command-line

Here is a site for novice bioinformatics users: https://astrobiomike.github.io/unix/

You will log in to the shell.cqls.oregonstate.edu machine, which is also known as vaughan:

Connecting to shell.cqls.oregonstate.edu

This machine is named after Dorothy Johnson Vaughan (September 20, 1910 – November 10, 2008) who was an American mathematician and human computer who worked for the National Advisory Committee for Aeronautics (NACA), and NASA, at Langley Research Center in Hampton, Virginia. In 1949, she became acting supervisor of the West Area Computers, the first African-American woman to supervise a group of staff at the center.

On login, you will be in your home directory on the infrastructure.

1
2
3
4
5
6
7


[davised@vaughan ~]$ pwd
/raid1/home/cgrb/davised
[davised@vaughan ~]$ quota -s -f /raid1
Disk quotas for user davised (uid 1104):
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
kent.cgrb.oregonstate.local:/raid1
                  8180M  25600M    100G            153k       0       0

Each of the networked volumes is stored on a separate file server.

Networked drives on the CQLS infrastructure

You have 25G of space in your home directory, which is not very much once you start generating raw sequencing data or downloading a lot of data from NCBI. As mentioned above, if your professor or department has filespace purchased already, you should have had them submit a request to get you access to those resources as well.

Generally, these lab or department directories will be somewhere such as:

1
2
3


/nfs0/MICRO/Giovannoni_Lab
/dfs/BPP/Megraw_Lab
/nfs3/IB/Dalziel_Lab

or more generically:

1

/[dn]fs[0123]/DEPARTMENT/PI_Lab

Each lab space will be managed differently by the members of the lab. In general, each lab member will have a designated folder in their lab space where they put their files, programs, and projects.

In order to help you manage your projects, we have a program called git installed that can save versions of your scripts as you write them. You can back up your repositories made by the git command to our gitlab - https://gitlab.cgrb.oregonstate.edu - This will help if you accidentally change a file and need to recover it, or if you want to work on the same project from different computers. NOTE: git is not a complete replacement for real dedicated backups. git cannot save very big files and has some trouble with directories with lots of smaller files. See the version control with git section of the primer for more information.

Contact Matthew Peterson about getting backups for your lab space running. We only charge for the space for backups; there is no extra cost to maintain or recover backups!

There are several ways to get you into your lab’s directory from your personal home directory.

The three most common are:

Setting a shell variable to the path of your lab directory
- home=/path/to/lab/home
- usage: cd $home
Making an alias that changes your directory to the lab directory
- alias home='cd /path/to/lab/home'
- usage: home
Linking the lab directory in your home, and then using that path along with the tilde (~) shortcut to get home faster.
- cd; ln -s /path/to/lab/home ./home
- usage: cd ~/home

Using one of these shortcuts will save a lot of time and energy over the long haul. For the record, I use the first set explained above in my daily routines, and it works the best for me. Since it’s a variable, I can use the shortcut for things other than just changing my directory as well, it’s available to any command that I type.

Where are my programs?

For the vast majority of commonly used programs, we will have a version installed. Look at this chapter of the primer to learn more about permissions and the $PATH variable. So, most programs will already be available to you when you log in, without you needing to do anything. Pretty neat!

Additionally, even if you cannot find something in your $PATH, we have other locations the program could be installed.

The first, is /local/cluster/bin. This directory may not be in your $PATH, and so therefore the programs there would not be available for your shell to see unless you specified the full path to the programs. The second place would be /local/cluster/program or some variation of that, e.g. /local/cluster/SPAdes for the SPAdes assembler.

If you still can’t find the program, submit a support ticket request here and someone should contact you in approximately 24-48, pending other requests. You can also submit tickets for program updates; it’s difficult to keep track of updating thouasands of pieces of software at once!

Additionally, you can maintain a set of some of your own software. We strongly encourage users to submit a request form, so that everyone can benefit from the program updates. If you are working on projects with other members of your lab, you could set up a shared program directory so that you could all use the same new programs.

Compiling a program from source is examined in the primer. I’d recommend calling your local program directory either ‘local’ or ‘opt’ -> /nfs/DEPT/Lab/user/local or /nfs/DEPT/Lab/user/opt, and then adding the ‘bin’ dir one level below that to your path. export PATH=/nfs/DEPT/Lab/user/local/bin:${PATH} for bash or setenv PATH /nfs/DEPT/Lab/user/local/bin:${PATH} for tcsh (current default shell)
You can add the export/setenv line to your .cshrc or .bashrc depending on your terminal so that it will always be in your $PATH, even when you log out and log back in. When you are installing new programs, then specify --prefix=/nfs/DEPT/Lab/user/local and you will always get the programs installed into your $PATH automatically.

Running jobs on the CQLS infrastructure

Compute nodes on the CQLS infrastructure

Now you know where programs are and where you are going to store the data. The last piece of data you need is how to submit a job to the Son of Grid Engine (SGE AKA successor to Sun Grid Engine) queuing system. Some basic information is found here:

https://arc.liv.ac.uk/SGE/howto/basic_usage.html

Command	Purpose	Notes
qstat	Check status of running jobs; queue information	Specify -u ‘*’ for all users.; qstat -g c -U for available resources (SGE_Avail is a wrapper for this)
qsub	Submit script containing configuration & shell commands	Use SGE_Array or SGE_Batch for wrappers
qhost	Shows information about each compute node
qrsh/qlogin	Submits shell or login commands	`qrsh` with no arguments submis a qlogin request
qacct	Allows querying of finished jobs	`qacct -u <username> -d 7 -j -E` shows jobs finished in the last 7 days

In order to make things easier for our users, we have provided three scripts to help with job submission and queue maintanence:

Wrapper	Purpose	Notes
SGE_Batch	Submits a single job to the queue	Must provide -r flag where the standard error/output will go; -c will take a quoted set of commands or an executable shell script as an option
SGE_Array	Submits many jobs from a text file to the queue	Useful when submitting a lot of similar type of job at once
SGE_Avail	Shows queueing resources available to you at this time	Specify alternative user with the -u flag

All resource intensive commands should be run through the queueing system. Moving/copying files, editing text files, and setting up shell scripts are fine to do from waterman, but everything else should be done on a compute node through SGE.

In order to check out a node for interactive use, you can use the qrsh command as listed above. To check out multiple processors at once on a machine, use the qrsh -pe thread N and replace N with the number of processors you’d like to check out.

Please exit the node when you are finished so that those resources can be reclaimed by the queue.

Hey, without a graphical interface, how do I even edit a text file?!

Relevant section of the primer

I know, this was a problem for me too. You have several options:

nano, emacs, and vi(m) are the three primary command-line text editors. They are listed, arguably, in order of learning curve, from easiest to hardest. I would encourage everyone who plans to live on the command-line for an extended period to attempt to use vim - it keeps your hands on the home row and allows you to have higher productivity. I understand there is a steep learning curve because I went through it myself, having used emacs for a long time prior to starting vim.

Here’s a place to get a vim config file for starters: https://github.com/amix/vimrc

There have been numerous articles discussing this, so you can search on Google about the editor wars. I’ll link two articles here to save you time.

Medium post, Linux training academy post

In the end, you’ll just have to pick something. If you are unhappy with your decision, then you can always switch to another option!

Can I run BLAST from the command-line?

Yes! We have set your $BLASTDB environment variable to /nfs1/CGRB/databases/NCBI/latest_blast_DB for you. If you need to search against the nt or nr databases, you can specify them as the -db flag for blastn or blastp e.g. blastp -db nr -query query.fasta -outfmt 7 -out query_vs_nr.tab to do a BLAST search against nr with a query sequence.

NOTE: BLASTDBs will be updated to v5 soon. You shouldn’t notice any discernable differences in usage, but if you do, let us know!

BLASTDBs have been updated to v5! The new location is:

/nfs1/CGRB/databases/NCBI/v5/latest_blast_DB

You should not have to change anything yourself to access these DBs, the env var has been changed for you.

See here for more info: https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

Here’s an example BLAST command you can run on the infrastructure to give you an idea of how to use the SGE system.

1

SGE_Batch -c 'blastp -db nr -outfmt 7 -query /nfs1/CGRB/databases/test-data/ACE2-Hsapiens.fasta -num_threads 8 -max_target_seqs 50 -out ACE2-Hsapiens_vs_nr.tab' -q micro -P 8 -r sge.blastp-test

Replace the -q micro with a queue that is available for you. This should take around 14 minutes with 8 threads. You can check if the job is running with the qstat command. I prefix all of my SGE jobs with sge. but it’s not necessary.

One restriction on naming is that runids cannot start with a number; sometimes when I generate automated runids using, for example, sample names, that may start with numbers, having the sge. prefix ensures that each of the runids are valid.

You’ll see something like this:

1
2
3
4
5


        * Beginning the Data run
                RunID = sge.blastp-test
                Dir = sge.blastp-test

        * Your job 8028648 ("sge.blastp-test") has been submitted

You’ll see a directory has been generated that has the same name as the runid that you’ve provided. Within the directory, there are files that contain the standard error and standard output of the job. Here’s the relevant section of the primer.

When the job is finished, you can see some stats about the run in the .e i.e. Standard Error file. You can also examine the the stats about the run using the qacct command e.g.

1

qacct -j 8028648 -u davised

Now you should be able to get around the infrastructure and find your programs!

How do the compute nodes access the same data as on vaughan?

A router connects each node and the shell machine

A router connects all of the compute nodes to the same data that is stored on networked drives. The networked drives are attached to several different servers that run the file system.

In this way, the same files one edits on vaughan are available on each compute node, and vice-versa.

How do I get data on and off the infrastructure?

The files.cqls.oregonstate.edu machine allows transfers

You can connect using sftp or scp to the files.cqls.oregonstate.edu machine, aka dayhoff, to get your data on or off of the infrastructure. Please do not run compute jobs on dayhoff, and submit those jobs to the queuing system instead.

Feel free to contact me if you have specific questions.