Submitting data to the NCBI Sequence Read Archive (SRA)

Submitting data to the NCBI Sequence Read Archive (SRA)

Pre-processing steps

  • Collect fastq.gz files for each sample
  • Rename fastq.gz files from long name to short name per sample
  • Get biosample data early (especially if it’s from collaborators)

Basic Checklist

  • Project Title
  • Public Description for Project
  • Grant Funding

Start a New Submission

The example that I will provide is for a host-associated 16S sequencing study.

Go to the link above and start a submission. Fill out the project title, description, grant funding, and departmental information as requested.

Choose Packages for metagenome submitters if you are following along with host-associated 16S data, then MIMS Environmental/Metagenome from the GSC MIxS section on the right. Otherwise, fill out the form with your organism and follow along.

Choose Upload a file using Excel or text format (tab-delimited) that includes the attributes for each of your BioSamples and download the template.

Minimum Biosample Checklist

Note: columns marked with delete should be completely deleted

Columns marked with * are required

env_broad_scale and env_local_scale will have a value if from environmental sources, but not applicable from host sources.

  • *sample_name
  • sample_title - delete
  • bioproject_accession - delete
  • *organism - mouse metagenome (or other host metagenome)
  • *collection_date - YYYY-mm-dd as one acceptable format
  • *env_broad_scale - not applicable
  • *env_local_scale - animal-associated environment [ENVO:01001002] (or other host-environment)
  • *env_medium - fecal material [ENVO:00002003] (or other host tissue)
  • *geo_loc_name - USA: Oregon
  • *host - Mus musculus (or other species binomial)
  • *lat_lon - 44.566 N 123.283 W
  • genetic_mod
  • host_common_name - mouse (or other species common name)
  • host_diet
  • host_genotype
  • host_sex - male or female
  • host_subject_id
  • host_taxid - 10090 or other host taxid
  • misc_param
  • perturbation - experimental or control group
  • neg_cont_type - kit or water

Minimum SRA metadata

  • sample_name - must match biosample name submitted above
  • library_ID - can match sample_name
  • title - 16S metabarcoding of Mus musculus: feces
  • library_strategy - AMPLICON
  • library_source - METAGENOMIC
  • library_selection - PCR
  • library_layout - paired
  • platform - ILLUMINA
  • instrument_model - Illumina MiSeq
  • design_description - choose from below
    • Earth Microbiome Project 16S PCR protocol
    • Illumina 16S PCR protocol
  • filetype - fastq
  • filename - sample_name_R1.fastq.gz
  • filename2 - sample_name_R2.fastq.gz
  • filename3 - delete
  • filename4 - delete
  • assembly - delete
  • fasta_file - delete

Uploading data using ftp

  • Expand the FTP instructions
  • Log in to using ssh
  • Navigate to the directory containing the reads
  • Connect to NCBI over ftp using given credentials:
    • ftp -i
    • -i flag allows multiple transfers without confirming
  • Change to your given directory on the website
  • mkdir a new directory for this submission
  • Use mput to upload multiple files at once
    • mput *.fastq.gz
  • Wait until the files are available to select in the web interface
    • May take 10+ minutes
  • Choose the direcory in the Select preload folder dialog and click continue