# Seqenv 1.3.0


{{< admonition tip "Conda" true >}}
See the 'activating the conda environment' section below to access this
software.
{{< /admonition >}}

## seqenv

* Assign environment ontology (EnvO) terms to short DNA sequences.
* All code written by [Lucas Sinclair](http://envonautics.com/#lucas).
* Publication at: https://peerj.com/articles/2690/

### Usage

Once that is done, you can start processing FASTA files from the command line.
For using the default parameters you can just type:

    $ seqenv sequences.fasta

We will then assume that you have inputed 16S sequences. To modify the
database or input a different type of sequences:

    $ seqenv sequences.fasta --seqtype prot --search_db nr

To modify the minimum identity in the similarity search, use the following:

    $ seqenv sequences.fasta --min_identity 0.97

If you have abundance data you would like to add to your analysis you can
specify it like this in a TSV file:

    $ seqenv sequences.fasta --abundances counts.tsv

### All parameters
Several other options are possible. Here is a list describing them all:

  * `--seq_type`: Sequence type `nucl` or `prot`, for nucleotides or amino
    acids respectively (Default: `nucl`).
  * `--search_algo`: Search algorithm. Either `blast` or `vsearch` (Default:
    `blast`).
  * `--search_db`: The database to search against (Default: `nt`). You can
    specify the full path or make a `~/.ncbirc` file.
  * `--normalization`: Can be either of `flat`, `ui` or `upui`. This option
    defaults to `flat`.
    * If you choose `flat`, we will count every isolation source
      independently, even if the same text appears several times for the same
      input sequence.
    * If you choose `ui`, standing for unique isolation, we will count every
      identical isolation source only once within the same input sequence.
    * If you choose `upui`, standing for unique isolation and unique pubmed-ID,
      we will uniquify the counts based on the text entry of the isolation
      sources as well as on the pubmed identifiers from which the GI obtained.
  * `--proportional`: Should we divide the counts of every input sequence by
    the number of envo terms that were associated to it. Defaults to `True`.
  * `--backtracking`: For every term identified by the tagger, we will
    propagate frequency counts up the acyclic directed graph described by the
    ontology. Defaults to `False`.
  * `--restrict`: Restrict the output to the descendants of just one ENVO
    term. This removes all other terms that are not reachable through the
    given node. For instance you could specify: `ENVO:00010483` (Disabled by
    default)
  * `--num_threads`: Number of cores to use (Defaults to the total number of
    cores). Use `1` for non-parallel processing.
  * `--out_dir`: The output directory in which to store the result and
    intermediary files. Defaults to the same directory as the input file.
  * `--min_identity`: Minimum identity in similarity search (Default: `0.97`).
    Note: not available when using `blastp`.
  * `--e_value`: Minimum e-value in similarity search (Default: `0.0001`).
  * `--max_targets`: Maximum number of reference matches in the similarity
    search (Default: `10`).
  * `--min_coverage`: Minimum query coverage in similarity search (Default:
    `0.97`).
  * `--abundances`: Abundances file as TSV with OTUs as rows and sample names
    as columns (Default: None).
  * `--N`: If abundances are given, pick only the top N sequences (Disabled by
    default).

### Why make this ?

The continuous drop in the associated costs combined with the increased
efficiency of the latest high-throughput sequencing technologies has resulted
in an unprecedented growth in sequencing projects. Ongoing endeavors such as
the [Earth Microbiome Project](http://www.earthmicrobiome.org) and the [Ocean
Sampling Day](http://www.microb3.eu/osd) are transcending national boundaries
and are attempting to characterize the global microbial taxonomic and
functional diversity for the benefit of mankind. The collection of sequencing
information generated by such efforts is vital to shed light on the ecological
features and the processes characterizing different ecosystems, yet, the full
knowledge discovery potential can only be unleashed if the associated meta
data is also exploited to extract hidden patterns. For example, with the
majority of genomes submitted to NCBI, there is an associated PubMed
publication and in some cases there is a GenBank field called "isolation
sources" that contains rich environmental information.

With the advances in community-generated standards and the adherence to
recommended annotation guidelines such as those of
[MIxS](http://gensc.org/gc_wiki/index.php/MIxS) of the Genomics Standards
Consortium, it is now feasible to support intelligent queries and automated
inference on such text resources.

The [Environmental Ontology](http://environmentontology.org/) (or EnvO) will
be a critical part of this approach as it gives the ontology for the concise,
controlled description of environments. It thus provides structured and
controlled vocabulary for the unified meta data annotation, and also serves as
a source for naming environmental information. Thus, we have developed the
`seqenv` pipeline capable of annotating sequences with environment descriptive
terms occurring within their records and/or in relevant literature.

The `seqenv` pipeline can be applied to any set of nucleotide or protein
sequences. Annotation of metagenomic samples, in particular 16S rRNA sequences
is also supported.

The pipeline has already been applied to a range of datasets (e.g Greek
lagoon, Swedish lake/river, African and Asian pitlatrine datasets, Black Sea
sediment sample datasets have been processed).

### What does it do exactly ?
Given a set of DNA sequences, `seqenv` first retrieves highly similar
sequences from public repositories (e.g. NCBI GenBank) using BLAST or similar
algorithm. Subsequently, from each of these homologous records, text fields
carrying environmental context information such as the reference title and the
**isolation source** field found in the metadata are extracted. Once the
relevant pieces of text from each matching sequence have been gathered, they
are processed by a text mining module capable of identifying any EnvO terms
they contain (e.g. the word "glacier", or "pelagic", "forest", etc.). The
identified EnvO terms along with their frequencies of occurrence are then
subjected to multivariate statistics, producing matrices relating your samples
to their putative sources as well as other useful outputs.

### Pipeline overview
The publication contains more information of course, but here is a schematic
overview of what happens inside `seqenv`:

{{< image src="../images/seqenv/frequencies.png" caption="Seqenv diagram" >}}

### Tutorial

We will first run `seqenv` on a 16S rRNA dataset using ***isolation sources***
as a text source. Here, `abundance.tsv` is a species abundance file (97% OTUs)
processed through [`illumitag`](https://github.com/limno/illumitag) software
and `centers.fasta` contains the corresponding sequences for the OTUs.

```console
$ ls
abundance.tsv
centers.fasta

$ seqenv centers.fasta --abundances abundance.tsv --seq_type nucl --out_dir output --N 1000 --min_identity 0.99
```

The output you will receive should look something like this:

```console
seqenv version 1.2.0 (pid 52169)
Start at: 2016-03-02 00:22:09.727377
--> STEP 1: Parse the input FASTA file.
Elapsed time: 0:00:00.005811
Using: output/renamed.fasta
--> STEP 2: Similarity search against the 'nt' database with 5 processes
Elapsed time: 0:02:11.215829
--> STEP 3: Filter out bad hits from the search results
Elapsed time: 0:00:00.002071
--> STEP 4: Parsing the search results
Elapsed time: 0:00:00.002099
--> STEP 5: Setting up the SQLite3 database connection.
Elapsed time: 0:00:00.054077
Got 81 GI hits and 65 of them had one for more EnvO terms associated.
--> STEP 6: Computing EnvO term frequencies.
Elapsed time: 0:00:00.721455
------------
Success. Outputs are in 'output/'
End at: 2016-03-02 00:24:22.504485
Total elapsed time: 0:02:12.777297
```

Once the pipeline has finished processing, you will have the following
contents in the output folder:

```console
$ ls output/
list_concepts_found.tsv  samples_to_names.tsv  seq_to_names.tsv   top_seqs.fasta.parts
renamed.fasta            seq_to_concepts.tsv   top_seqs.blastout
samples.biom             seq_to_gis.pickle     top_seqs.fasta
```

The most interesting files are probably:

* `list_concepts_found.tsv` links every OTU to all its relevant BLAST hits and linked ENVO terms.
* `seq_to_names.tsv` a matrix linking every OTU to its "composition" in terms
  of ENVO identifiers translated to readable names.
* `samples_to_names.tsv` if an abundance file was provided, this is a a matrix
  linking every one of your samples to its "composition" in terms of ENVO
  identifiers translated to readable names.
* `graphviz/` directory containing ontology graphs for everyone of the inputed
  sequences, such as in the following example:

![seqenv ontology graph](../images/seqenv/ontology_graph.png)

-------------------------------------------------------------------------------

## Activating the conda environment

Check out a node with `qrsh` and run:

```console
bash
source /local/cluster/seqenv/activate.sh
```

To use over SGE, add the source line above to your shell scripts prior to your
seqenv commands.

## Usage notes

Please always set `--num_threads` to the desired number of threads, otherwise
all available threads will be used.

The `seqenv` command is looking for a specific version of `nt`. This version
is already downloaded in `/nfs1/CGRB/databases/seqenv` so you do not need to
download it. The `$BLASTDB` variable will update upon conda env activation, so
you can reference it using `nt` (or just leave the db as default).

## Location and version

```console
$ which seqenv
/local/cluster/seqenv-1.3.0/bin/seqenv
```

## help message

```console
$ seqenv --help
usage: seqenv [-h] [--min_identity MIN_IDENTITY] [--min_coverage MIN_COVERAGE]
              [--out_dir OUT_DIR] [--abundances ABUNDANCES] [--N N]
              [--restrict RESTRICT] [--proportional PROPORTIONAL]
              [--search_db SEARCH_DB] [--backtracking BACKTRACKING]
              [--seq_type SEQ_TYPE] [--num_threads NUM_THREADS]
              [--search_algo SEARCH_ALGO] [--normalization NORMALIZATION]
              [--max_targets MAX_TARGETS] [--e_value E_VALUE]
              input_file

seqenv version 1.3.0

positional arguments:
  input_file            The fasta file to process

optional arguments:
  -h, --help            show this help message and exit
  --min_identity MIN_IDENTITY
                        Minimum identity in similarity search. Defaults to 0.97.
  --min_coverage MIN_COVERAGE
                        Minimum query coverage in similarity search. Defaults to 0.97.
  --out_dir OUT_DIR     Place all the outputs in the specified directory. Defaults to the input file's directory.
  --abundances ABUNDANCES
                        If you have sample information, give is as a TSV file with OTUs as rows and sample names as columns.
  --N N                 Use only the top `N` sequences in terms of their abundance. Disabled by default.
  --restrict RESTRICT   Restrict the output to the descendants of just one ENVO term.
  --proportional PROPORTIONAL
                        Should we divide the counts of every input sequence by the number of envo terms that were associated to it. Defaults to `True`.
  --search_db SEARCH_DB
                        The path to the database to search against. Defaults to `nt`.
  --backtracking BACKTRACKING
                        For every term identified by the tagger, we will propagate the frequency counts up the acyclic directed graph described by the ontology. Defaults to `False`.
  --seq_type SEQ_TYPE   Either `nucl` or `prot`. Defaults to `nucl`.
  --num_threads NUM_THREADS
                        Number of threads to use. Default to the number of cores on the current machine.
  --search_algo SEARCH_ALGO
                        Either 'blast' or 'usearch'. Defaults to `blast`.
  --normalization NORMALIZATION
                        What normalization strategy should we use for the frequency counts. Refer to the README.
  --max_targets MAX_TARGETS
                        Maximum number of reference matches in similarity search. Defaults to 10.
  --e_value E_VALUE     Minimum e-value in similarity search. Defaults to 0.0001.
```

software ref: <https://github.com/xapple/seqenv>  
research ref: <https://doi.org/10.7717/peerj.2690>