How to download an NGS dataset from an online archive? 🧬

2 min readJun 20, 2021

TL;DR: A simplified guide for downloading sequenced datasets from NCBI Sequence Read Archive. This is just one way of doing it! 😌

Disclaimer: I’m neither a bioinformatician nor an expert in this domain but a graduate student with a CS background. This article is solely compiled from the gained self-knowledge at the Agricultural Biotechnology Centre of Peradeniya University as a research assistant.

Tools

SRA Toolkit (Official installation guide or install using conda install -c bioconda sra-tools command — still, if you are not using conda, then install it 😛)

Sometimes you may want to include a sequencing dataset from a previous study for your analysis. It could be either DNA or RNA dataset. Almost all the time, you can find them on the NCBI Sequence Read Archive, which the publishers deposited. The following guide would be helpful when downloading those datasets.

For example, let’s assume you want to download the Whole-genome sequencing dataset with SRA accession SRR11577048¹.

Step 1 — Download dataset in SRA format

Download the dataset in SRA format using the command prefetch, which is a part of the SRA Toolkit.

prefetch --max-size 50G --progress SRR11577048

use --max-size option to change the download size limit (default is 20Gigs and this SRA file is about 49Gigs)

Step 2 — Convert SRA format to standard fastq format

We can use fasterq-dump of the SRA Toolkit to convert SRA format to fastq format. (fasterq-dump is the multi-threaded version of thefastq-dump tool, and it is only available via SRA Toolkit ≥ v2.9.1)

fasterq-dump --split-files -e 8 SRR11577048

To change the number of threads, use -e option (default is 6 threads).
This will create two fastq files named SRR11577048_1.fastq and SRR11577048_2.fastq. In this case, each file is about 180Gigs 🤯.

Step 3 — Compress fastq files for space saving

Use gzip to compress two fastq files.

gzip SRR11577048_1.fastq SRR11577048_2.fastq

Two files will be gzipped into SRR11577048_1.fastq.gz and SRR11577048_2.fastq.gz of 36Gigs each 😌.

Instead of gzip, you can use parallel gzip tool, pigz. To install pigz: sudo apt update; sudo apt install pigz. Use -p to select the number of threads.

pigz -p 8 SRR11577048_1.fastq SRR11577048_2.fastq

Further detailed information

References
[1] Lyons, Leslie A., et al. “Whole genome sequencing in cats, identifies new models for blindness in AIPL1 and somite segmentation in HES7.” BMC genomics 17.1 (2016): 1–11.