3. Useful commands
3.1. Download FASTQ sequencing data from the SRA
To download fastq sequences deposited in the Sequence Read Archive (SRA) of the NCBI we wil use SRA Tools. You can install SRA Tools with Conda.
Go to the SRA website and search the accession number of the study you interested in (e.g. PRJEB30331). A list of all the sequences deposited under that accession number is returned, click on Send results to run selector. In the Run Selector all the details associated with the sequencing data are reported and you can select single samples/sequences by ticking the box next to each one. If you select any samples, remember to tick also the Selected button. Then click on the tab Metadata
A file SRR_Acc_List.txt is downloaded containing the list of samples that you selected (the Run accession number), or the full list if you did not select any samples.
The sra filea are downloaded in a folder created by SRA Tools which normally has the following path: ~/ncbi/public/sra.
To download the sequences type the following command:
cat SRR_Acc_List.txt | xargs -I{} prefetch {}
To extract the fastq sequences from the sra file use fastq-dump:
fastq-dump --split-files --gzip *.sra
Validate the integrity of the sra file:
vdb-validate SRR061294.sra
3.2. Download referece sequences from the NCBI
Download the assembly_summary.txt file from the RefSeq database for the taxonomic group of interest. You can download assemblies of different groups:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/protozoa/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Yersinia_pestis/assembly_summary.txt
Keep in mind that the RefSeq content includes assembled genome sequence and annotation data. All RefSeq genomes have annotation. Subdirectories include:
archaea
bacteria
fungi
invertebrate
plant
protozoa
vertebrate_mammalian
vertebrate_other
viral
mitochondrion [Content is from the RefSeq release FTP site.]
plasmid [Content is from the RefSeq release FTP site.]
plastid [Content is from the RefSeq release FTP site.]
Once downloaded, we will used this file to retrieve the ftp address that contains the folder with the genome of the species we are looking for, plus many other files.
You can open the assembly_summary file in a text editor and search the assemblies available the species of interest.
Otherwise you can explore the complete genomes and the latest assemblies for a peculiar species by piping grep and awk. AWK is a scripting language for advanced text elaboration. You can find more infos in here (italian, english).
grep "Yersinia pestis" assembly_summary.txt | awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $0}'
If too many results are returned you can pipe less, and use the option -S to view the lines as unwrapped (they will be truncated if longer than the screen width):
grep "Yersinia pestis" assembly_summary.txt | awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $0}' | less
grep "Yersinia pestis" assembly_summary.txt | awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $0}' | less -S
Note
The print sintax in awk determines which column of the text file is displayed based on the column separator, which in turn is defined by the option -F ("\t" stands for tab-delimited text). To display all the columns use $0, otherwise select the columns based on the order ($1,$2 to display the first two columns).
You can display only the columns corresponding to the ftp directory ($20), the organism name ($8) and the strain ($9), and copy the ftp path to the species directory of interest (this will be used in the rsync command below).
grep "Yersinia pestis" assembly_summary.txt | awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $8,$9,$20}' | less
If you want, you can also redirect the output to a file (called here ftpdirpaths.txt).
grep "Yersinia pestis" assembly_summary.txt | awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $8,$9,$20}' > ftpdirpaths.txt
In this example, the ftp path of the species of interest looks like this: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/717/545/GCF_001717545.1_ASM171754v1
To download the ftp directory containing the reference sequence, we will use rsync, a versatile file copying tool that can copy locally, to/from another host over any remote shell. Paste the ftp path to the following command to download the sequence:
rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/717/545/GCF_001717545.1_ASM171754v1 .
Inside the folder, the reference sequence that you will use for the alignment of your reads has the suffix _genomic.fna.gz.