# Seqenv 1.3.0 {{< admonition tip "Conda" true >}} See the 'activating the conda environment' section below to access this software. {{< /admonition >}} ## seqenv * Assign environment ontology (EnvO) terms to short DNA sequences. * All code written by [Lucas Sinclair](http://envonautics.com/#lucas). * Publication at: https://peerj.com/articles/2690/ ### Usage Once that is done, you can start processing FASTA files from the command line. For using the default parameters you can just type: $ seqenv sequences.fasta We will then assume that you have inputed 16S sequences. To modify the database or input a different type of sequences: $ seqenv sequences.fasta --seqtype prot --search_db nr To modify the minimum identity in the similarity search, use the following: $ seqenv sequences.fasta --min_identity 0.97 If you have abundance data you would like to add to your analysis you can specify it like this in a TSV file: $ seqenv sequences.fasta --abundances counts.tsv ### All parameters Several other options are possible. Here is a list describing them all: * `--seq_type`: Sequence type `nucl` or `prot`, for nucleotides or amino acids respectively (Default: `nucl`). * `--search_algo`: Search algorithm. Either `blast` or `vsearch` (Default: `blast`). * `--search_db`: The database to search against (Default: `nt`). You can specify the full path or make a `~/.ncbirc` file. * `--normalization`: Can be either of `flat`, `ui` or `upui`. This option defaults to `flat`. * If you choose `flat`, we will count every isolation source independently, even if the same text appears several times for the same input sequence. * If you choose `ui`, standing for unique isolation, we will count every identical isolation source only once within the same input sequence. * If you choose `upui`, standing for unique isolation and unique pubmed-ID, we will uniquify the counts based on the text entry of the isolation sources as well as on the pubmed identifiers from which the GI obtained. * `--proportional`: Should we divide the counts of every input sequence by the number of envo terms that were associated to it. Defaults to `True`. * `--backtracking`: For every term identified by the tagger, we will propagate frequency counts up the acyclic directed graph described by the ontology. Defaults to `False`. * `--restrict`: Restrict the output to the descendants of just one ENVO term. This removes all other terms that are not reachable through the given node. For instance you could specify: `ENVO:00010483` (Disabled by default) * `--num_threads`: Number of cores to use (Defaults to the total number of cores). Use `1` for non-parallel processing. * `--out_dir`: The output directory in which to store the result and intermediary files. Defaults to the same directory as the input file. * `--min_identity`: Minimum identity in similarity search (Default: `0.97`). Note: not available when using `blastp`. * `--e_value`: Minimum e-value in similarity search (Default: `0.0001`). * `--max_targets`: Maximum number of reference matches in the similarity search (Default: `10`). * `--min_coverage`: Minimum query coverage in similarity search (Default: `0.97`). * `--abundances`: Abundances file as TSV with OTUs as rows and sample names as columns (Default: None). * `--N`: If abundances are given, pick only the top N sequences (Disabled by default). ### Why make this ? The continuous drop in the associated costs combined with the increased efficiency of the latest high-throughput sequencing technologies has resulted in an unprecedented growth in sequencing projects. Ongoing endeavors such as the [Earth Microbiome Project](http://www.earthmicrobiome.org) and the [Ocean Sampling Day](http://www.microb3.eu/osd) are transcending national boundaries and are attempting to characterize the global microbial taxonomic and functional diversity for the benefit of mankind. The collection of sequencing information generated by such efforts is vital to shed light on the ecological features and the processes characterizing different ecosystems, yet, the full knowledge discovery potential can only be unleashed if the associated meta data is also exploited to extract hidden patterns. For example, with the majority of genomes submitted to NCBI, there is an associated PubMed publication and in some cases there is a GenBank field called "isolation sources" that contains rich environmental information. With the advances in community-generated standards and the adherence to recommended annotation guidelines such as those of [MIxS](http://gensc.org/gc_wiki/index.php/MIxS) of the Genomics Standards Consortium, it is now feasible to support intelligent queries and automated inference on such text resources. The [Environmental Ontology](http://environmentontology.org/) (or EnvO) will be a critical part of this approach as it gives the ontology for the concise, controlled description of environments. It thus provides structured and controlled vocabulary for the unified meta data annotation, and also serves as a source for naming environmental information. Thus, we have developed the `seqenv` pipeline capable of annotating sequences with environment descriptive terms occurring within their records and/or in relevant literature. The `seqenv` pipeline can be applied to any set of nucleotide or protein sequences. Annotation of metagenomic samples, in particular 16S rRNA sequences is also supported. The pipeline has already been applied to a range of datasets (e.g Greek lagoon, Swedish lake/river, African and Asian pitlatrine datasets, Black Sea sediment sample datasets have been processed). ### What does it do exactly ? Given a set of DNA sequences, `seqenv` first retrieves highly similar sequences from public repositories (e.g. NCBI GenBank) using BLAST or similar algorithm. Subsequently, from each of these homologous records, text fields carrying environmental context information such as the reference title and the **isolation source** field found in the metadata are extracted. Once the relevant pieces of text from each matching sequence have been gathered, they are processed by a text mining module capable of identifying any EnvO terms they contain (e.g. the word "glacier", or "pelagic", "forest", etc.). The identified EnvO terms along with their frequencies of occurrence are then subjected to multivariate statistics, producing matrices relating your samples to their putative sources as well as other useful outputs. ### Pipeline overview The publication contains more information of course, but here is a schematic overview of what happens inside `seqenv`: {{< image src="../images/seqenv/frequencies.png" caption="Seqenv diagram" >}} ### Tutorial We will first run `seqenv` on a 16S rRNA dataset using ***isolation sources*** as a text source. Here, `abundance.tsv` is a species abundance file (97% OTUs) processed through [`illumitag`](https://github.com/limno/illumitag) software and `centers.fasta` contains the corresponding sequences for the OTUs. ```console $ ls abundance.tsv centers.fasta $ seqenv centers.fasta --abundances abundance.tsv --seq_type nucl --out_dir output --N 1000 --min_identity 0.99 ``` The output you will receive should look something like this: ```console seqenv version 1.2.0 (pid 52169) Start at: 2016-03-02 00:22:09.727377 --> STEP 1: Parse the input FASTA file. Elapsed time: 0:00:00.005811 Using: output/renamed.fasta --> STEP 2: Similarity search against the 'nt' database with 5 processes Elapsed time: 0:02:11.215829 --> STEP 3: Filter out bad hits from the search results Elapsed time: 0:00:00.002071 --> STEP 4: Parsing the search results Elapsed time: 0:00:00.002099 --> STEP 5: Setting up the SQLite3 database connection. Elapsed time: 0:00:00.054077 Got 81 GI hits and 65 of them had one for more EnvO terms associated. --> STEP 6: Computing EnvO term frequencies. Elapsed time: 0:00:00.721455 ------------ Success. Outputs are in 'output/' End at: 2016-03-02 00:24:22.504485 Total elapsed time: 0:02:12.777297 ``` Once the pipeline has finished processing, you will have the following contents in the output folder: ```console $ ls output/ list_concepts_found.tsv samples_to_names.tsv seq_to_names.tsv top_seqs.fasta.parts renamed.fasta seq_to_concepts.tsv top_seqs.blastout samples.biom seq_to_gis.pickle top_seqs.fasta ``` The most interesting files are probably: * `list_concepts_found.tsv` links every OTU to all its relevant BLAST hits and linked ENVO terms. * `seq_to_names.tsv` a matrix linking every OTU to its "composition" in terms of ENVO identifiers translated to readable names. * `samples_to_names.tsv` if an abundance file was provided, this is a a matrix linking every one of your samples to its "composition" in terms of ENVO identifiers translated to readable names. * `graphviz/` directory containing ontology graphs for everyone of the inputed sequences, such as in the following example: ![seqenv ontology graph](../images/seqenv/ontology_graph.png) ------------------------------------------------------------------------------- ## Activating the conda environment Check out a node with `qrsh` and run: ```console bash source /local/cluster/seqenv/activate.sh ``` To use over SGE, add the source line above to your shell scripts prior to your seqenv commands. ## Usage notes Please always set `--num_threads` to the desired number of threads, otherwise all available threads will be used. The `seqenv` command is looking for a specific version of `nt`. This version is already downloaded in `/nfs1/CGRB/databases/seqenv` so you do not need to download it. The `$BLASTDB` variable will update upon conda env activation, so you can reference it using `nt` (or just leave the db as default). ## Location and version ```console $ which seqenv /local/cluster/seqenv-1.3.0/bin/seqenv ``` ## help message ```console $ seqenv --help usage: seqenv [-h] [--min_identity MIN_IDENTITY] [--min_coverage MIN_COVERAGE] [--out_dir OUT_DIR] [--abundances ABUNDANCES] [--N N] [--restrict RESTRICT] [--proportional PROPORTIONAL] [--search_db SEARCH_DB] [--backtracking BACKTRACKING] [--seq_type SEQ_TYPE] [--num_threads NUM_THREADS] [--search_algo SEARCH_ALGO] [--normalization NORMALIZATION] [--max_targets MAX_TARGETS] [--e_value E_VALUE] input_file seqenv version 1.3.0 positional arguments: input_file The fasta file to process optional arguments: -h, --help show this help message and exit --min_identity MIN_IDENTITY Minimum identity in similarity search. Defaults to 0.97. --min_coverage MIN_COVERAGE Minimum query coverage in similarity search. Defaults to 0.97. --out_dir OUT_DIR Place all the outputs in the specified directory. Defaults to the input file's directory. --abundances ABUNDANCES If you have sample information, give is as a TSV file with OTUs as rows and sample names as columns. --N N Use only the top `N` sequences in terms of their abundance. Disabled by default. --restrict RESTRICT Restrict the output to the descendants of just one ENVO term. --proportional PROPORTIONAL Should we divide the counts of every input sequence by the number of envo terms that were associated to it. Defaults to `True`. --search_db SEARCH_DB The path to the database to search against. Defaults to `nt`. --backtracking BACKTRACKING For every term identified by the tagger, we will propagate the frequency counts up the acyclic directed graph described by the ontology. Defaults to `False`. --seq_type SEQ_TYPE Either `nucl` or `prot`. Defaults to `nucl`. --num_threads NUM_THREADS Number of threads to use. Default to the number of cores on the current machine. --search_algo SEARCH_ALGO Either 'blast' or 'usearch'. Defaults to `blast`. --normalization NORMALIZATION What normalization strategy should we use for the frequency counts. Refer to the README. --max_targets MAX_TARGETS Maximum number of reference matches in similarity search. Defaults to 10. --e_value E_VALUE Minimum e-value in similarity search. Defaults to 0.0001. ``` software ref: research ref: