# Using SRA toolkit


# Getting started with the SRA Toolkit

Let's say you want to download some sequences from the sequence read archive 
(SRA) at NCBI. How do you do it? First, we need to make sure the SRA toolkit 
binaries are in your path.

```
which fastq-dump
```
Should produce the full path to your fastq-dump binary. If it's not present,
then we'll need to add the directory where it is stored to your path. Copy
the following commands (for tcsh and bash, respectively) and run them to add
the SRA toolkit bin directories into your path.

```
echo 'setenv PATH ${PATH}:/local/cluster/sratoolkit/bin' >> ~/.cshrc
echo 'export PATH=${PATH}:/local/cluster/sratoolkit/bin' >> ~/.bashrc
```

Now you can run `exec tcsh` or `exec bash` (or `exit` and re-login) to get your
updated `$PATH`. Now, when you type `which fastq-dump`, you should be presented
with the full path to the `fastq-dump` binary.

The next step is to set the path where the temporary (large) sequence read
archive files will be stored. By default, these go into your home directory,
and they can quickly use up all of your home directory space. Therefore, we
want to specify a location to store them that is on a `/nfs` drive.

Generally, this location will be something like:

```
/nfs[0123]/<YOUR DEPT>/<YOUR LAB>/<USERNAME>/ncbi/tmp
```

So, identify the full path to this location, and then copy it, and run the
following command:

```
vdb-config -i --interactive-mode textual
```

Type 4, and then paste the path that you previously copied. Type Y and
`<ENTER>`.

You can confirm that these changes were made by running this command:

```
cat $HOME/.ncbi/user-settings.mkfg
```

You should see that `/repository/user/main/public/root` has the value that you
provided to `vdb-config`.

Next, we'll need to download the SRA accession(s) that you are interested in.

Navigate to the [SRA website](https://www.ncbi.nlm.nih.gov/sra) and copy the 
run accession(s) that you want. These have the SRR prefix. Generally I like to
put these into a text file. I name the file `accs.txt` but the name isn't
important for the operation as long as you remember the purpose of the file in
the future.

For the prefetch stage, you'll want to sign into `files.cgrb.oregonstate.edu`
instead of `shell.cgrb.oregonstate.edu`. This is to help reduce web traffic
congestion on the shell. Once you are on `files`, you should navigate to the
folder that contains the `accs.txt` file, and type this command:

```
cat accs.txt | xargs prefetch
```

Alternatively, in order to potentially speed up the download, you can use up
to 4 CPUs using `parallel` as below:

```
export TMPDIR=/tmp
cat accs.txt | parallel -j 4 prefetch
```

This will feed the accessions to the prefetch command, which will result in the
raw sequences being downloaded from NCBI's servers. You should see some
progress messages as the files download.

Next, head back over to `shell.cgrb.oregonstate.edu` so we can extract the 
FASTQ files from the raw prefetched data. At this point it's important to note 
what type of reads you are expecting. You'll have to ensure that you get paired
end reads from the SRA accessions where they are expected. The newer program
called `fasterq-dump` appears to be aware of paired-end datasets, and splits
them accordingly even if the option is not specified. Therefore, you should
be able to run:

```
SGE_Batch -c "cat accs.txt | xargs -n 1 fasterq-dump -t /data/<USERNAME>/ncbi" -q <QUEUE NAME> -r sge.fasterq-dump -P 6
```

And get your reads extracted into .fastq files. By default, `fasterq-dump` uses
6 threads, but you can specify a different amount using the `-e` flag. I would
**NOT** recommend submitting this type of operation as an array job because
you will hammer the filesystem and potentially bring everyone's jobs to a
crawl. You need to specify the temporary directory with `-t` so that the
program uses the local hard drive for the node to store the intermediate files
and only copy the final .fastq files to the networked file system.

Using the `-n 1` flag of `xargs`, your accessions will be extrated serially,
which is what we want in this case, in order to reduce the load on the file 
servers.

Now, you should be able to `ls` and see your brand new .fastq files named with
the SRRXXXXX.fastq for single end data, and SRRXXXXX_[12].fastq for paired end
data. You can explicitly specify the `-S` flag to split the paired end files
as well if you aren't getting the expected outputs.

Happy downloading!