How to download an NGS dataset from an online archive? 🧬
TL;DR: A simplified guide for downloading sequenced datasets from NCBI Sequence Read Archive. This is just one way of doing it! 😌
Disclaimer: I’m neither a bioinformatician nor an expert in this domain but a graduate student with a CS background. This article is solely compiled from the gained self-knowledge at the Agricultural Biotechnology Centre of Peradeniya University as a research assistant.
Tools
- SRA Toolkit (Official installation guide or install using
conda install -c bioconda sra-tools
command — still, if you are not using conda, then install it 😛)
Sometimes you may want to include a sequencing dataset from a previous study for your analysis. It could be either DNA or RNA dataset. Almost all the time, you can find them on the NCBI Sequence Read Archive, which the publishers deposited. The following guide would be helpful when downloading those datasets.
For example, let’s assume you want to download the Whole-genome sequencing dataset with SRA accession SRR11577048¹.
Step 1 — Download dataset in SRA format
Download the dataset in SRA format using the command prefetch
, which is a part of the SRA Toolkit.
prefetch --max-size 50G --progress SRR11577048
use --max-size
option to change the download size limit (default is 20Gigs and this SRA file is about 49Gigs)
Step 2 — Convert SRA format to standard fastq format
We can use fasterq-dump
of the SRA Toolkit to convert SRA format to fastq format. (fasterq-dump
is the multi-threaded version of thefastq-dump
tool, and it is only available via SRA Toolkit ≥ v2.9.1)
fasterq-dump --split-files -e 8 SRR11577048
To change the number of threads, use -e
option (default is 6 threads).
This will create two fastq files named SRR11577048_1.fastq
and SRR11577048_2.fastq
. In this case, each file is about 180Gigs 🤯.
Step 3 — Compress fastq files for space saving
Use gzip to compress two fastq files.
gzip SRR11577048_1.fastq SRR11577048_2.fastq
Two files will be gzipped into SRR11577048_1.fastq.gz
and SRR11577048_2.fastq.gz
of 36Gigs each 😌.
Instead of gzip
, you can use parallel gzip tool, pigz
. To install pigz: sudo apt update; sudo apt install pigz
. Use -p
to select the number of threads.
pigz -p 8 SRR11577048_1.fastq SRR11577048_2.fastq
Further detailed information
References
[1] Lyons, Leslie A., et al. “Whole genome sequencing in cats, identifies new models for blindness in AIPL1 and somite segmentation in HES7.” BMC genomics 17.1 (2016): 1–11.