Frequenty Asked Questions

2022-10-01 2085 words 10 minutes

Contents

New User ‘onboarding’ / FAQs

Assumptions

You have experience using the command-line; if not, see Training below

Accounts

When using mac or linux-based operating systems, you should have access to a terminal and the ssh program that will let you connect to the infrastructure. Using windows, you will need to use the Windows Subsystem for Linux (WSL) or putty.

You should log in to shell.cgrb.oregonstate.edu using the information provided upon signing up. In order to eliminate the need to use duo, you can set up ssh keys that help confirm your identity.

How do I change my password?

The passwd program allows you to change your password. You’ll have to enter your current (even temporary) password before entering your new password twice. If you have ssh keys set up, entering your password just to log in is not required. Your password still is useful for signing in to https://gitlab.cqls.oregonstate.edu

What if I forget my password?

You have to submit a support ticket at https://shell.cqls.oregonstate.edu/support/

What am I allowed to do on the shell server?

On shell.cqls.oregonstate.edu, a machine named vaughan, you can edit text files, submit jobs to our queuing system (currently SGE), and do basic text processing. Jobs requiring lots of processors and/or memory will be killed.

What am I NOT allowed to do on the shell server? Why?

Most processing jobs will be killed. This is so that everyone has equal access for logging in to shell.cqls and that processing jobs are not slowing down the shell machine. If all processors on the shell.cqls machine were used, then users would have a difficult time logging in and currently logged in users would have difficulty submitting jobs to SGE. If your command on shell.cqls gets killed, please submit the job using SGE_Batch or SGE_Array.

What is the default shell?

The default shell is tcsh. Users can request a change of default shell to bash by submitting a ticket.

What is a shell?

The shell is a command-line interface between you (the user) and the computer or server. The shell interprets what you type as commands and interprets the commands such that the computer or server understands what you want to do.

Do I have a quota for my $HOME directory?

Users have a 25GB quota for their home directories.

How do I check my quota?

Use the quota -s command to see your current usage and quota.

What do I do if I go over my quota?

You will need to remove files if you exceed your quota such that you get under the 25GB limit.

What should I store in my $HOME directory?

Minimal configuration and other files should be stored in your $HOME directory. All processing should be done on networked filesystem drives and the local /data drives of the processing machines.

File Transfers

What server should I use to transfer files via SFTP/SCP?

You must use files.cqls.oregonstate.edu for file transfers. File transfers are disabled on shell.cqls.oregonstate.edu.

What is SFTP?

SFTP stands for Secure File Transfer Protocol. SFTP allows for file transfers using the same security provided by ssh.

What is SCP?

SCP stands for Secure Copy Protocol. The scp program allows secure file transfers from the infrastructure to your own computer.

Can I use FTP?

FTP (File Transfer Protocol) is insecure and CQLS servers will not host files over FTP. However, you can use the ftp program from files.cqls.oregonstate.edu to transfer files externally, e.g. to NCBI, if necessary.

Why can’t I transfer files via SFTP to shell.cgrb?

The shell.cqls.oregonstate.edu machine should be used for processing only, not file transfers.

TBD

I want to publish data via the web, how can I do this?

You can be provided a lab directory on the files.cqls.oregonstate.edu machine to publish data externally. Please submit a support request to find out more information. See here for examples

Can I access my files via the web

Not currently.

Storage

What is ZFS / NFS?

ZFS is a file system and volume management technology that scales indefinitely and emphasizes zero data loss. We use the ZFS protocol on our networked file system (NFS) drives.

What is DFS / Quobyte?

DFS is a distributed file system. Its use has been discontinued at the CQLS.

What is stored on NFS, what should I be using it for?

All data and outputs should be stored on the NFS. We recommend using the /data drives, which are specific to each compute node, for doing the analysis and then copying the results back onto a NFS location.

Does my lab have NFS space?

You may have access to NFS; you will need to ask the post-doc/professor in your lab who will know.

I accidentally deleted a file; are my files backed up?

The $HOME directory is backed up. Each NFS location may or may not be backed up depending on if your lab pays for storage backup. Please contact Matthew Peterson if you need backup for your space.

What is tape backup?

Tape backup is a long term storage backup for recovery after some disaster. All sequencing runs are copied to tape prior to deletion. Each lab is still responsible for copying and maintaining raw sequencing data; tape backup is used for emergencies only.

How do I get more space added to our NFS space?

Please contact Chris Sullivan to purchase more NFS disk space.

Batch Processing

What is batch processing?

A batch process is one that can run without human interaction. When we submit processes in a non-interactive mode on the infrastructure, we are submitting batch processes.

What is SGE?

SGE is Son of Grid Engine, which is a queuing system. SGE allows us to submit batch jobs to different compute nodes across the infrastructure, such that each job runs when resources permit.

What is an SGE queue?

We have different queues available on the infrastructure so that each lab may have different resources available at any given time.

What queues are available to me?

Not all labs/colleges have access to the same resources. You can see which resources are available to you by running SGE_Avail.

How do I submit jobs?

There are multiple ways to submit jobs. A single job can be run with the SGE_Batch command. More information about queueing systems will be released in a separate post.

What is SGE Array?

SGE_Array allows easy submission of array jobs, which are most commonly used when a user has a command that they want to run on many individual inputs (10s-1000s). Instead of submitting hundreds or thousands of jobs, you can submit a single array job and control how many tasks are running at once.

How do I check the status of jobs?

The qstat command allows you to see which jobs are running. You can look at a single task by running qstat -j $JOBID to see more about a specific task.

How do I kill jobs?

You kill jobs with the qdel $JOBID command.

What happens if I see an ‘E’ state?

If you see an E state, you should run qstat -j $JOBID to determine what the error message is. Sometimes, error state can be cleared with the clear_Eqw_job.sh command. Other times, the job should be killed with qdel so that you can change settings before re-submitting the job.

How do I know how long my bioinformatics job will run?

We suggest running a small test dataset through your pipeline(s) to determine an expected amount of processing time/resource utilization. The user is responsible for ensuring that their jobs are only using the amount of resources requested in the queuing system. Please monitor your jobs (you can qrsh to the same node that your job is running on to check the health of the machine using e.g. htop) to ensure everything is going as requested.

How does my lab obtain more processing resources?

Most colleges have access to shared computing resources; if you are a member of a college where you think you should have access to machines and they are not available, please submit a support request. If you or your college does not have resources available, we have machines available to rent for up to 6 months in length. If you have more needs than that, you can email Chris Sullivan and discuss other options, including current costs of purchasing machines.

Another lab has asked me to collaborate with them, but I cannot access their files or compute resources, what do I do?

Email Chris Sullivan and cc the appropriate collaborators to get access to their files.

Support Tickets

How do I follow up to obtain further support?

To request general support, use the support form

Please use the ‘cgrb-support’ option for general questions and ‘cgrb-software’ option for software requests.

How do I accurately describe my issue?

Please provide information regarding:

What machine you are having an issue with (use qstat to see the node)
What software you are trying to run
What your expected output is, and what the observed output is
A copy/paste of any error messages you may have
How to reproduce the issue
What you may have tried to resolve the issue
Any links to the software or software help pages that might help

If you are submitting a software install request, please provide a link to the github page or other source material.

How do I check on the status of my support ticket?

Please follow up by emailing Support for support requests, and Ed Davis for software requests.

Training

How can I learn more about using the command line?

We offer ‘Intro to Unix/Linux’ and ‘Command-line data analysis’ courses - see the Workshops page for more information. We also offer one-on-one training for an hourly fee; please email the bioinformatics team.

Software

What software do I use for…

Adapter trimming

For automated adapter trimming, we currently recommend fastp. fastp is a good option for situations where you have adapters on the 3’ end of reads due to read-through of short inserts into the sequencing adapter. For trimming of primer sequences or other custom sequences, we suggest using bbduk.sh or cutadapt.

Short read alignment

bwa mem

Spliced alignment

STAR or hisat2

Long read alignment

minimap2

Genome assembly (Illumina)

SPAdes

Genome assembly (long read)

flye or nextdenovo

RNA-Seq quantification

salmon

Differential gene expression analysis

deseq2

Pairwise sequence alignment

blast or diamond

Multiple sequence alignment

mafft --auto

Phylogenetic tree construction

IQ-TREE; fasttree can be useful for preliminary analysis

Orthologous group calculation (Prokaryote)

PIRATE for cultured organisms; PPanGGOLiN for MAGs/SAGs

anvio is useful for pangenome analysis as well

Principal component analysis or other ordination/dimensional reduction

Using R: prcomp for principal component analysis (PCA); vegan for nonmetric multidimensional scaling (NMDS) or constrained ordination e.g. redundancy analysis (RDA); ape for principal coordinate analysis (PCoA)

How do I…

Figure out how to run a program

Use -h e.g. $program -h
Use –help e.g. $program --help
Use help e.g. $program help
Use man (this works for system installed things like cat, mkdir, ls) e.g. man $program
Use tldr (works for common programs, awk, sed, tar, wget) e.g. tldr $program
Examine the script with less e.g. which $program; less -S /local/cluster/bin/$program (Note: does not work for compiled software)
Search the program name on google
Search the program name on the updates website https://software.cqls.oregonstate.edu/updates/tags

You can also try helpme for some curated help data.

Display a formatted markdown file on the command line

Use glow

Download reads from NCBI

Use prefetch and fasterqdump. See here for more info.

Download genomes from NCBI

You can use the data-hub to get genome data. Use files.cqls.oregonstate.edu for downloads.

You can also use the get_assemblies program (use python3 -m pip install --user get-assemblies to install).

Generate a BLASTDB

Use makeblastdb -in INPUT.fasta --dbtype [nucl|prot] to generate your BLASTDB. Please submit using SGE_Batch -c ....

Do a BLASTN or BLASTP search

Use blastp or blastn. Use the -help flag for options. Do not use blastall as it is old and unsupported now.

Miscellaneous

Do my sequences have adapters?

All index sequences will not be included in the raw reads. Adapters could be on the 3' end of reads, depending on the library prep type. Please check the FASTQC reports sent by Matthew with each sequencing run to determine if your reads have adapters. Use the fastp program to remove them.

What is going on with my 16S sequencing results?

See this page for some information regarding the 16S preps provided at the CQLS.