Frequenty Asked Questions

2022-10-01 3027 words 15 minutes

Contents

New User ‘onboarding’ / FAQs

Assumptions

You have experience using the command-line; if not, see Training below

Accounts

When using mac or linux-based operating systems, you should have access to a terminal and the ssh program that will let you connect to the infrastructure. Using windows, you will need to use the Windows Subsystem for Linux (WSL) or putty.

You should log in to shell.cgrb.oregonstate.edu using the information provided upon signing up. In order to eliminate the need to use duo, you can set up ssh keys that help confirm your identity.

How do I change my password?

The passwd program allows you to change your password. You’ll have to enter your current (even temporary) password before entering your new password twice. If you have ssh keys set up, entering your password just to log in is not required. Your password still is useful for signing in to https://gitlab.cqls.oregonstate.edu

What if I forget my password?

You have to submit a support ticket at https://shell.cqls.oregonstate.edu/support/

What am I allowed to do on the shell server?

On shell.cqls.oregonstate.edu, a machine named vaughan, you can edit text files, submit jobs to our queuing system (currently SGE), and do basic text processing. Jobs requiring lots of processors and/or memory will be killed.

What am I NOT allowed to do on the shell server? Why?

Most processing jobs will be killed. This is so that everyone has equal access for logging in to shell.cqls and that processing jobs are not slowing down the shell machine. If all processors on the shell.cqls machine were used, then users would have a difficult time logging in and currently logged in users would have difficulty submitting jobs to SGE. If your command on shell.cqls gets killed, please submit the job using SGE_Batch or SGE_Array.

What is the default shell?

The default shell is tcsh. Users can request a change of default shell to bash by submitting a ticket.

What is a shell?

The shell is a command-line interface between you (the user) and the computer or server. The shell interprets what you type as commands and interprets the commands such that the computer or server understands what you want to do.

Do I have a quota for my $HOME directory?

Users have a 25GB quota for their home directories.

How do I check my quota?

Use the quota -s command to see your current usage and quota.

What do I do if I go over my quota?

You will need to remove files if you exceed your quota such that you get under the 25GB limit.

What should I store in my $HOME directory?

Minimal configuration and other files should be stored in your $HOME directory. All processing should be done on networked filesystem drives and the local /data drives of the processing machines.

How do I edit my $PATH variable and save it across log-ins?

The exact changes you need to make depend on your $SHELL (either bash or tcsh).

I suggest making the change temporarily first, and then change your configuration file (~/.bashrc for bash, and ~/.cshrc for tcsh).

You’ll need the full path to the directory that contains the program(s) you want to add to your $PATH. To temporarily add the programs to your $PATH, run the appropriate command below (export for bash, setenv for tcsh):

1
2


export PATH=/path/to/new/directory:${PATH}
setenv PATH /path/to/new/directory:${PATH}

After you test out the new command, then you can add those lines to your config file:

1
2


echo 'export PATH=/path/to/new/directory:${PATH}' >> ~/.bashrc
echo 'setenv PATH /path/to/new/directory:${PATH}' >> ~/.cshrc

Keys to doing this properly are:

Make sure to use the >> append redirect so you don’t overwrite your config (> will overwrite the file).
Make sure to use single quotes ' for your echo command, otherwise the ${PATH} variable will get expanded unecessarily.

You can also just edit the appropriate file with a text editor (vim, emacs, nano) as you feel comfortable.

After you add the changes to your config files, the updated $PATH will get loaded on every new log-in shell.

File Transfers

What server should I use to transfer files via SFTP/SCP?

You must use files.cqls.oregonstate.edu for file transfers. File transfers are disabled on shell.cqls.oregonstate.edu.

What is SFTP?

SFTP stands for Secure File Transfer Protocol. SFTP allows for file transfers, both to and from the infrastructure, using the same security provided by ssh.

What is SCP?

SCP stands for Secure Copy Protocol. The scp program allows secure file transfers from the infrastructure to your own computer.

Can I use FTP?

FTP (File Transfer Protocol) is insecure and CQLS servers will not host files over FTP. However, you can use the ftp program from files.cqls.oregonstate.edu to transfer files externally, e.g. to NCBI, if necessary.

Why can’t I transfer files via SFTP to shell.cgrb?

The shell.cqls.oregonstate.edu machine should be used for processing only, not file transfers.

TBD

I want to publish data via the web, how can I do this?

You can be provided a lab directory on the files.cqls.oregonstate.edu machine to publish data externally. Please submit a support request to find out more information. See here for examples

Can I access my files via the web

See above.

Storage

What is ZFS / NFS?

ZFS is a file system and volume management technology that scales indefinitely and emphasizes zero data loss. We use the ZFS protocol on our networked file system (NFS) drives.

What is DFS / Quobyte?

DFS is a distributed file system. Its use has been discontinued at the CQLS.

What is stored on NFS, what should I be using it for?

All data and outputs should be stored on the NFS. We recommend using the /data drives, which are specific to each compute node, for doing the analysis and then copying the results back onto a NFS location.

Does my lab have NFS space?

You may have access to NFS; you will need to ask the post-doc/professor in your lab who will know.

I accidentally deleted a file; are my files backed up?

The $HOME directory is backed up. Each NFS location may or may not be backed up depending on if your lab pays for storage backup. Contact Support if you need to start backups on your space. Please contact Matthew Peterson if you need to recover previously backed up data.

What is tape backup?

Tape backup is a long term storage backup for recovery after some disaster. All sequencing runs are copied to tape prior to deletion. Each lab is still responsible for copying and maintaining raw sequencing data; tape backup is used for emergencies only, and is not guaranteed.

How do I get more space added to our NFS space?

Please contact Support to purchase more NFS disk space.

Batch Processing

What is batch processing?

A batch process is one that can run without human interaction. When we submit processes in a non-interactive mode on the infrastructure, we are submitting batch processes.

What is SGE?

SGE is Son of Grid Engine, which is a queuing system. SGE allows us to submit batch jobs to different compute nodes across the infrastructure, such that each job runs when resources permit.

What is an SGE queue?

We have different queues available on the infrastructure so that each lab may have different resources available at any given time.

What queues are available to me?

Not all labs/colleges have access to the same resources. You can see which resources are available to you by running SGE_Avail.

How do I submit jobs?

There are multiple ways to submit jobs. A single job can be run with the SGE_Batch command. More information about queueing systems will be released in a separate post.

How do I check out a compute node for interactive use?

Make sure you have a queue with the I (interactive) attribute using SGE_Avail. Then, check out the node using the qrsh command. You can request multiple processors using the qrsh -pe thread N option, where N is the number of processors you’d like to check out.

What is SGE Array?

SGE_Array allows easy submission of array jobs, which are most commonly used when a user has a command that they want to run on many individual inputs (10s-1000s). Instead of submitting hundreds or thousands of jobs, you can submit a single array job and control how many tasks are running at once.

How do I check the status of jobs?

The qstat command allows you to see which jobs are running. You can look at a single task by running qstat -j $JOBID to see more about a specific task.

How do I kill jobs?

You kill jobs with the qdel $JOBID command.

What happens if I see an ‘E’ state?

If you see an E state, you should run qstat -j $JOBID to determine what the error message is. Sometimes, error state can be cleared with the clear_Eqw_job.sh command. Other times, the job should be killed with qdel so that you can change settings before re-submitting the job.

How do I know how long my bioinformatics job will run?

We suggest running a small test dataset through your pipeline(s) to determine an expected amount of processing time/resource utilization. The user is responsible for ensuring that their jobs are only using the amount of resources requested in the queuing system. Please monitor your jobs (you can qrsh to the same node that your job is running on to check the health of the machine using e.g. htop) to ensure everything is going as requested.

How does my lab obtain more processing resources?

Most colleges have access to shared computing resources; if you are a member of a college where you think you should have access to machines and they are not available, please submit a support request. If you or your college does not have resources available, we have machines available to rent for up to 6 months in length. If you have more needs than that, you can email Support and discuss other options, including current costs of purchasing machines.

Another lab has asked me to collaborate with them, but I cannot access their files or compute resources, what do I do?

Email Support and cc the appropriate collaborators to get access to their files.

Support Tickets

How do I follow up to obtain further support?

To request general support, use the support form

Please use the ‘cgrb-support’ option for general questions and ‘cgrb-software’ option for software requests.

How do I accurately describe my issue?

Please provide information regarding:

What machine you are having an issue with (use qstat to see the node)
What software you are trying to run
What your expected output is, and what the observed output is
A copy/paste of any error messages you may have
How to reproduce the issue
What you may have tried to resolve the issue
Any links to the software or software help pages that might help

If you are submitting a software install request, please provide a link to the github page or other source material.

How do I check on the status of my support ticket?

Please follow up by emailing Support for support requests, and Ed Davis for software requests.

Training

How can I learn more about using the command line?

We offer ‘Intro to Unix/Linux’ and ‘Command-line data analysis’ courses - see the Workshops page for more information. We also offer one-on-one training for an hourly fee; please email the bioinformatics team.

Software

Conda

How do I get conda set up?

Follow the instructions here.

How do I fix my broken login/configs

The raw configuration files can be found here:

1
2


/local/cluster/etc/inits/.bashrc
/local/cluster/etc/inits/.cshrc

You can make a backup of your current file and then copy the raw configuration files into your home directory. You can also remove a ~/.tcshrc file if it’s present.

1
2
3
4
5


mv ~/.bashrc ~/.bashrc.bak
mv ~/.cshrc ~/.cshrc.bak
rm -f ~/.tcshrc
cp /local/cluster/etc/inits/.bashrc ~
cp /local/cluster/etc/inits/.cshrc ~

Then log out and log back in. If your configuration seems fixed, you can add some of the modifications from your backups to the newly refreshed config files.

If I already have conda set up, how do I access the system envs?

If you set up conda prior to February 2023, you likely do not have access to the latest conda envs when you run conda env list; they are installed in /local/cluster/conda-envs/envs.

If that’s the case for you, please run these commands to get access to them:

1
2
3


bash
conda config --append envs_dirs /local/cluster/conda-envs/envs
conda config --append pkgs_dirs /local/cluster/conda-envs/pkgs

Where can I learn more about conda env activation?

See this link from the conda documentation

For most conda environments on our infrastructure, I run the scripts in /local/cluster/conda/conda_*_setup.sh to resolve version mismatches.

R isn’t working in my conda env, why?

You likely have $R_LIBS or $R_LIBS_USER set and the R is pulling libraries from your home directory or other location that are incompatible with the R environment. You can manually unset those env vars or go to the base env of your conda directory and run bash /local/cluster/conda/conda_R_setup.sh to automatically set up the appropriate env vars on conda activate and conda deactivate.

For a copy/paste option if the conda env is active:

1
2
3
4


cd $CONDA_PREFIX
conda deactivate
bash /local/cluster/conda/conda_R_setup.sh
conda activate .

See above for more information

Python is not working or has version mismatches

You may need to unalias python unalias python. You can set it in your ~/.bashrc or ~/.cshrc files as well. You will have to fully type out /local/cluster/bin/python2 or /local/cluster/bin/python3 or add /local/cluster/bin in your $PATH upstream of /usr/bin e.g. export PATH=/local/cluster/bin:${PATH} so you don’t have to type it out fully.

Your python is probably pulling from your .local install. You can manually unset those env vars or go to the base env of your conda directory and run bash /local/cluster/conda/conda_python_setup.sh to automatically set up the appropriate env vars on conda activate and conda deactivate.

For a copy/paste option if the conda env is active:

1
2
3
4


cd $CONDA_PREFIX
conda deactivate
bash /local/cluster/conda/conda_python_setup.sh
conda activate .

See above for more information

Perl is not working or has version mismatches

Your perl is probably pulling from your PERL5LIB env var. You can manually unset those env vars or go to the base env of your conda directory and run bash /local/cluster/conda/conda_perl_setup.sh to automatically set up the appropriate env vars on conda activate and conda deactivate.

For a copy/paste option if the conda env is active:

1
2
3
4


cd $CONDA_PREFIX
conda deactivate
bash /local/cluster/conda/conda_perl_setup.sh
conda activate .

See above for more information

Software is not working due to a mismatch in linked libraries (lib.so missing)

The compiler on our infrastructure is old (gcc 4.8.5), and does not provide the most up-to-date linked libraries. The conda LD_LIBRARY_PATH is not getting set properly. You can manually set your LD_LIBRARY_PATH to include $CONDA_PREFIX/lib, or you can go to the base env of your conda directory and run bash /local/cluster/conda/conda_perl_setup.sh to automatically set up the appropriate env vars on conda activate and conda deactivate. For a copy/paste option if the conda env is active:

1
2
3
4


cd $CONDA_PREFIX
conda deactivate
bash /local/cluster/conda/conda_LD_setup.sh
conda activate .

See above for more information

What software do I use for…

Adapter trimming

For automated adapter trimming, we currently recommend fastp. fastp is a good option for situations where you have adapters on the 3’ end of reads due to read-through of short inserts into the sequencing adapter. For trimming of primer sequences or other custom sequences, we suggest using bbduk.sh or cutadapt.

Short read alignment

bwa mem

Spliced alignment

STAR or hisat2

Long read alignment

minimap2

Genome assembly (Illumina)

SPAdes

Genome assembly (long read)

flye or nextdenovo

Genome annotation (prokaryote)

Bakta

RNA-Seq quantification

salmon

Differential gene expression analysis

deseq2

Pairwise sequence alignment

blast or diamond

Multiple sequence alignment

mafft --auto

Phylogenetic tree construction

IQ-TREE; fasttree can be useful for preliminary analysis

Orthologous group calculation (Prokaryote)

PIRATE for cultured organisms; PPanGGOLiN for MAGs/SAGs

anvio is useful for pangenome analysis as well

Principal component analysis or other ordination/dimensional reduction

Using R: prcomp for principal component analysis (PCA); vegan for nonmetric multidimensional scaling (NMDS) or constrained ordination e.g. redundancy analysis (RDA); ape for principal coordinate analysis (PCoA)

How do I…

Figure out how to run a program

Use -h e.g. $program -h
Use –help e.g. $program --help
Use help e.g. $program help
Use man (this works for system installed things like cat, mkdir, ls) e.g. man $program
Use tldr (works for common programs, awk, sed, tar, wget) e.g. tldr $program
Examine the script with less e.g. which $program; less -S /local/cluster/bin/$program (Note: does not work for compiled software)
Search the program name on google
Search the program name on the updates website https://software.cqls.oregonstate.edu/updates/tags

You can also try helpme for some curated help data.

Display a formatted markdown file on the command line

Use glow

Download reads from NCBI

Use prefetch and fasterqdump. See here for more info.

Download genomes from NCBI

You can use the data-hub to get genome data. Use files.cqls.oregonstate.edu for downloads.

You can also use the get_assemblies program (use python3 -m pip install --user get-assemblies to install).

Generate a BLASTDB

Use makeblastdb -in INPUT.fasta --dbtype [nucl|prot] to generate your BLASTDB. Please submit using SGE_Batch -c ....

Do a BLASTN or BLASTP search

Use blastp or blastn. Use the -help flag for options. Do not use blastall as it is old and unsupported now.

Miscellaneous

My terminal output is garbled

Run reset. This should reset the output on your screen and you should be able to continue as normal.

Do my sequences have adapters?

All index sequences will not be included in the raw reads. Adapters could be on the 3' end of reads, depending on the library prep type. Please check the FASTQC reports sent by Matthew with each sequencing run to determine if your reads have adapters. Use the fastp program to remove them.

What is going on with my 16S sequencing results?

See this page for some information regarding the 16S preps provided at the CQLS.

Contents

Frequenty Asked Questions

New User ‘onboarding’ / FAQs

Assumptions

Accounts

How do I login? (SSH)

What server do I login to?

How do I change my password?

What if I forget my password?

What am I allowed to do on the shell server?

What am I NOT allowed to do on the shell server? Why?

What is the default shell?

What is a shell?

Do I have a quota for my $HOME directory?

How do I check my quota?

What do I do if I go over my quota?

What should I store in my $HOME directory?

How do I edit my $PATH variable and save it across log-ins?

File Transfers

What server should I use to transfer files via SFTP/SCP?

What is SFTP?

What is SCP?

Can I use FTP?

Why can’t I transfer files via SFTP to shell.cgrb?

Can I transfer files using a Windows drive share to the infrastructure?

I want to publish data via the web, how can I do this?

Can I access my files via the web

Storage

What is ZFS / NFS?

What is DFS / Quobyte?

What is stored on NFS, what should I be using it for?

Does my lab have NFS space?

I accidentally deleted a file; are my files backed up?

What is tape backup?

How do I get more space added to our NFS space?

Batch Processing

What is batch processing?

What is SGE?

What is an SGE queue?

What queues are available to me?

How do I submit jobs?

How do I check out a compute node for interactive use?

What is SGE Array?

How do I check the status of jobs?

How do I kill jobs?

What happens if I see an ‘E’ state?

How do I know how long my bioinformatics job will run?

How does my lab obtain more processing resources?

Another lab has asked me to collaborate with them, but I cannot access their files or compute resources, what do I do?

Support Tickets

How do I follow up to obtain further support?

How do I accurately describe my issue?

How do I check on the status of my support ticket?

Training

How can I learn more about using the command line?

Software

Conda

How do I get conda set up?

How do I fix my broken login/configs

If I already have conda set up, how do I access the system envs?

Where can I learn more about conda env activation?

R isn’t working in my conda env, why?

Python is not working or has version mismatches

Perl is not working or has version mismatches

Software is not working due to a mismatch in linked libraries (lib.so missing)

What software do I use for…

Adapter trimming

Short read alignment

Spliced alignment

Long read alignment

Genome assembly (Illumina)

Genome assembly (long read)

Genome annotation (prokaryote)

RNA-Seq quantification

Differential gene expression analysis

Pairwise sequence alignment

Multiple sequence alignment

Phylogenetic tree construction

Orthologous group calculation (Prokaryote)

Principal component analysis or other ordination/dimensional reduction