7. Alignment of reads to a reference genome
The metagenomic screenig of the shotgun library detected reads assigned to Yersinia pestis. The following step is to ascertain that these molecules are authentic.
You can do that by mapping your pre-processed fastq files (merged and trimmed) to the Yersinia pestis CO92 strain reference sequence, available in the RefSeq NCBI database.
Here, in order to obtain an optimal coverage for the subsequent variant call, we will run the alignment on a different fastq file that we prepared to simulate an enriched library.
7.1. Preparation of the reference sequence
7.1.1. Index the reference sequence with bwa
To align the reads to the reference sequence we will use the program BWA, in particular the BWA aln algorithm. BWA first needs to construct the FM-index for the reference genome, with the command BWA index. FM-indexing in Burrows-Wheeler transform is used to efficiently find the number of occurrences of a pattern within a compressed text, as well as locate the position of each occurrence. It is an essential step for querying the DNA reads to the reference sequence. This command generates five files with different extensions: amb, ann, bwt, pac, sa.
bwa index -a is reference.fasta
Note
The option -a indicates the algorithm to use for constructing the index. For genomes smaller than < 2 Gb use the is algorithm. For larger genomes (>2 Gb), use the bwtsw algorithm.
7.1.2. Create a reference dictionary
A dictionary file (dict) is necessary to run later in the pipeline GATK RealignerTargetCreator. A sequence dictionary contains the sequence name, sequence length, genome assembly identifier, and other information about the sequences. To create the dict file we use Picard.
picard CreateSequenceDictionary R= referece.fasta O= reference.dict
Note
In our server environment we can call Picard just by typing the program name. In other environments (including your laptop) you may have to call Picard by providing the full path to the java file jar of the program:
java -jar /path/to/picard.jar CreateSequenceDictionary R= referece.fasta O= ref.dict
7.1.3. Index the reference sequence with Samtools
The reference sequence has to be indexed in order to run later in the pipeline GATK IndelRealigner. To do that, we will use Samtools, in particular the tool samtools faidx, which enables efficient access to arbitrary regions within the reference sequence. The index file typically has the same filename as the corresponding reference sequece, with the extension fai appended.
samtools faidx reference.fasta
7.2. Alignment of the reads to the reference sequence
7.2.1. Alignment of pre-processed reads to the reference genome with BWA aln
To align the reads to the reference genome we will use BWA aln, which supports an end-to-end alignment of reads to the reference sequence. The alternative algorithm, BWA mem supports also local (portion of the reads) and chimeric alignments (resulting in a larger number of mapped reads than BWA aln). BWA aln is more suitable for aliging short reads, like expected for ancient DNA samples. The following comand will generate a sai file.
bwa aln reference.fasta filename.fastq.gz -n 0.1 -l 1000 > filename.sai
Some of the options available in BWA aln:
Option |
Function |
|---|---|
-n number |
Maximum edit distance if the value is an integer. If the value is float the edit distance is automatically chosen for different read lengths (default=0.04) |
-l integer |
Seed length. If the value is larger than the query sequence, seeding will be disabled. |
-o integer |
Maximum number of gap opens. For aDNA, tolerating more gaps helps mapping more reads (default=1). |
Note
Due to the particular damaged nature of ancient DNA molecules, carrying deaminations at the molecules ends, we deactivate the
seed-lengthoption by giving it a high value (e.g.-l 1000).Here we are aligning reads to a bacterial reference genome. To reduce the impact of spurious alignemnts due to presence bacterial species closely related to the one that we are investigating, we will adopt stringent conditions by decreasing the
maximum edit distanceoption (-n 0.1). For alignment of DNA reads to the human reference sequence, less stringent conditions can be used (-n 0.01and-o 2).
Once obtained the sai file, we align the reads (fastq file) to the reference (fasta file) using BWA samse, to generate the alignment file sam.
bwa samse reference.fasta filename.sai filename.fastq.gz > filename.sam
7.2.2. Converting sam file to bam file
For the downstream analyses we will work with the binary (more compact) version of the sam file, called bam. To convert the sam file in bam we will use Samtools view.
samtools view -Sb filename.sam > filename.bam
Note
The conversion from sam to bam can be piped (|) in one command right after the alignment step:
bwa samse reference.fasta filename.sai filename.fastq.gz | samtools view -Sb - > filename.bam
To view the content of a sam file we can just use standard commands like head, tail, less, while to view the content of a bam file (binary format of sam) we have to use Samtools view:
samtools view filename.bam
You may want to display on the screen one read/line (scrolling with the spacebar):
samtools view filename.bam | less -S
while to display just the header of the bam file:
samtools view -H filename.bam
7.2.3. Sorting and indexing the bam file
To go on with the analysis, we have to sort the reads aligned in the bam file by leftmost coordinates (or by read name when the option -n is used) with Samtools sort. The option -o is used to provide an output file name:
samtools sort filename.bam -o filename.sort.bam
The sorted bam files are then indexed with Samtools index. Indexes allow other programs to retrieve specific parts of the bam file without reading through each sequence. The following command generates a bai file, a companion file of the bam which contains the indexes:
samtools index filename.sort.bam
7.2.5. Removing duplicates
Amplification through PCR of genomic libraries leads to duplication formation (reads originating from a single fragment of DNA).
To remove duplicates we will use DeDup, which is specifically designed for ultra-short DNA (e.g. ancient DNA) paired-end and merged sequence data.
The issue with ultra-short DNA is that during the typical sequencing chemistry cycles the entire molecule will be sequenced and therefore we will find both ends.
Compared to typical deduplication tools, that only look for reads with the same starting position at the 5’ end of a read, DeDup will remove ‘true’ duplicates using BOTH (5’ and 3’) ends of the reads. This can help increase coverage in low-preservation samples such used in ancient DNA by being more exact as to what are duplicates or not.
DeDup generates a bam file with the suffix _rmdup.bam
dedup -i filename.RG.bam -o folder_name -m -u
Option |
Function |
|---|---|
-h |
Shows the help page |
-i string |
Select your input, otherwise you may use pipes to pipe in your data |
-m |
The input only contains merged reads - don’t care about missing prefixes for merged/reverse/forward reads |
-o string |
output folder |
-u |
Do not automatically sort the output |
Note
The output can be put in a separate folder (-o folder_name), or more simply if you are working in the folder containing the input bam, use -o . to have the output in that same location.
Note
Alternatively, you can use the MarkDuplicates tool of Picard, which marks the reads as duplicates when the 5’-end positions of both reads and read-pairs match. A metric file with various statistics is created, and reads are removed from the bam file by using the REMOVE_DUPLICATES=True option (the default option is False, which simply ‘marks’ duplicate reads keep them in the bam file).
picard MarkDuplicates I= filename.RG.bam O= filename.DR.bam M=output_metrics.txt REMOVE_DUPLICATES=True VALIDATION_STRINGENCY=LENIENT
Once removed the duplicates with DeDup, we sort the reads and index the bam file:
samtools sort filename_rmdup.bam -o filename_rmdup_sort.bam
samtools index filename_rmdup_sort.bam
7.2.6. Local realignment of reads
The presence of insertions or deletions (indels) in the genome may be responsible of misalignments and bases mismatches that are easily mistaken as SNPs. For this reason, we locally realign reads to minimize the number of mispatches around the indels. The realignment process is done in two steps using two different tools of GATK called with the -T option. We first detect the intervals which need to be realigned with the GATK RealignerTargetCreator, and save the list of these intevals in a file that we name target.intervals:
gatk -T RealignerTargetCreator -R reference.fasta -I filename_rmdup.bam -o target.intervals
Note
Like Picard, in some server environment you can call GATK just by typing the program name. In other environments (or in your local computer) you have to call GATK by providing the full path to the java jar file. For example, if it’s in a folder named GenomeAnalysisTK-3.8 in the home of your computer the absolute path to the file is ~/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar:
java -jar ~/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -h
Warning
In version 4 of GATK the indel realigment tools have been retired from the best practices (they are unnecessary if you are using an assembly based caller like Mutect2 or HaplotypeCaller). To use the indel realignment tools make sure to install version 3 of GATK.
Then, we realign the reads over the intervals listed in the target.intervals file with the option -targetIntervals of the tool IndelRealigner in GATK:
java -jar ~/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I filename_rmdup.bam -targetIntervals target.intervals -o filename.final.bam --filter_bases_not_stored
Note
If you want, you can redirect the standard output of the command into a
logfile by typing at the end of the command&> logFile.logThe option
--filter_bases_not_storedis used to filter out reads with no stored bases (i.e. with * where the sequence should be), instead of failing with an error
The final bam file has to be sorted and indexed as previously done:
samtools sort filename.final.bam -o filename.final.sort.bam
samtools index filename.final.sort.bam
7.2.7. Generate flagstat file
We can generate a file with useful information about our alignment with Samtools flagstat. This file is a final summary report of the bitwise FLAG fields assigned to the reads in the sam file.
samtools flagstat filename.final.sort.bam > flagstat_filename.txt
Note
You could generate a flagstat file for the two
bamfiles before and after refinement and see the differences.You can decode each
FLAGfield assigned to a read on the Broad Institute website.
7.2.8. Visualization of reads alignment
Once generated the final bam file, you can compare the bam files before and after the refinement and polishing process (duplicates removal, realignment around indels and sorting). To do so, we will use the program IGV, in which we will first load the reference fasta file from Genomes –> Load genome from file and then we will add one (or more) bam files with File –> Load from file:
7.3. Create mapping summary reports
We will use Qualimap to create summary reports from the generated bam files. As mentioned in the website, Qualimap examines sequencing alignment data in sam/bam files according to the features of the mapped reads and provides an overall view of the data that helps to detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
qualimap bamqc -c -bam input.bam
Here are some screenshots of the outputs:
At this stage we have created different type of summary report using FastQC and Qualimap. To create a unique summary that integrate and compare all the generated reports, we will use MultiQC. If all the reports are in the same directory and its sub-directories, you can run MultiQC as follows:
multiqc .
A list of programs that generate output files recognized by MultiQC are availble here: https://github.com/ewels/MultiQC
Multiqc will create a summary report in html format that will let you compare all the summary reports for each of your samples:
7.4. Alignment of reads to circular reference genomes
The mtDNA and most of the bacterial genomes are circular. However, their sequence is represented as linear when the alignment of reads is done. This poses limitations to our ability of mapping the reads at the two ends of the linearized genome. The tool CircularMapper circumvents these limitations by extending the end of the linearized reference genome by a number of nucleotides from the beginning of the sequence. This ‘simulates’ the contiguity of the circular sequence at the two ends of the linearized genome, facilitating the alignment of the reads in those regions and consequently achieving more even coverages at the two ends. You can install CircularMapper with Conda
7.4.1. Create an extended reference sequence
You can first check the header of your referece sequence (e.g. the human mtDNA reference sequence), to copy the name reported in the header (right after the > symbol, in this example NC_012920.1).
If you are using the full human genome, you can grep the headers of the various chromosomes and refer specifically to the name of the mtDNA to extend its sequence (e.g. use the option -s chrMT).
head human_mt.fasta
grep ">" human_mt.fasta
Then, you can run the circulargenerator command of CircularMapper to extend the end of the reference sequence with 200 nucleotides from its beginning. A file human_mt_200.fasta will be generated.
circulargenerator -e 200 -i human_mt.fasta -s NC_012920.1
Option |
Function |
|---|---|
-e digit |
Elongation factor (number of nucleotides copied from the beginning and pasted at the end of the reference sequence) |
-i string |
Input reference sequence name |
-s string |
Name of the reference sequence as reported in the header of the fasta file. |
7.4.2. Realign reads to original reference sequence
Once generated the extended reference sequence, you can use it to align the reads with bwa aln and bwa samse as described in the section Alignment of pre-processed reads to the reference genome with BWA aln. To keep note of the fact that an extended reference is used, you could include in the name of the sam file the elongation factor (e.g. filename_elongated200.bam, or similar)
After running bwa samse and once obtained the sam alignment file, you must readjust the mapped reads (which now span the circular portion of the genome) to the original reference sequence (the not-extended file) by running the realignsamfile command of CircularMapper.
realignsamfile -e 200 -i filename.sam -r human_mt.fasta
Option |
Function |
|---|---|
-e digit |
Elongation factor (number of nucleotides copied from the beginning and pasted at the end of the reference sequence) |
-i string |
Input sam file |
-r string |
Input reference fasta file (original, not extended) |
-f true/false |
Filter the reads that don’t map to a circular identifier, e.g. when using the full human genome (default: false). |
-x true/false |
Filter the sequence identifiers that are not circular identifiers, e.g. when using the full human genome (default: false). |
The command above will generate a filename_realigned.bam file, which must be sorted and indexed, as usual
samtools sort filename_realigned.bam -o filename_realigned.sort.bam
samtools index filename_realigned.sort.bam
You can then move forward with the other steps described for bam refinement from the step Adding Read Group tags and indexing bam files
7.5. Damage analysis and quality rescaling of the BAM file
To authenticate our analysis we will assess the post-mortem damage of the reads aligned to the reference sequence. We can track the post-portem damage accumulated by DNA molecules in the form of fragmentation due to depurination and cytosine deamination, which generates the typical pattern of C->T and G->A variation at the 5’- and 3’-end of the DNA molecules. To assess the post-mortem damage patterns in our bam file we will use mapDamage, which analyses the size distribution of the reads and the base composition of the genomic regions located up- and downstream of each read, generating various plots and summary tables. To start the analysis we need the final bam and the reference sequence:
mapDamage -i filename.final.sort.bam -r reference.fasta
mapDamage creates a new folder where the output files are created. One of these files, is named Fragmisincorporation_plot.pdf which contains the following plots:
If DNA damage is detected, we can run mapDamage again using the --rescale-only option and providing the path to the results folder that has been created by the program (option -d). This command will downscale the quality scores at positions likely affected by deamination according to their initial quality values, position in reads and damage patterns.
A new rescaled bam file is then generated.
mapDamage -i filename.final.sort.bam -r reference.fasta --rescale-only -d results_folder
You can also rescale the bam file directly in the first command with the option --rescale:
mapDamage -i filename.final.sort.bam -r reference.fasta --rescale
Note
Another useful tool for estimating post-mortem damage (PMD) is PMDTools. This program uses a model incorporating PMD, base quality scores and biological polymorphism to assign a PMD score to the reads. PMD > 0 indicates support for the sequence being genuinely ancient. PMDTools filters the damaged reads (based on the selected score) in a separate bam file which can be used for downstream analyses (e.g. variant call).
The rescaled bam file has to be indexed, as usual.
samtools index filename.final.sort.rescaled.bam
7.6. Edit Distance
The edit distance defines the number of nucleotide changes that have to be made to one read sequence for it to be identical to the reference sequence.
To be more confident about the quality and authenticity of your sequencing data, you need to align your reads againt your reference sequence and the genome of a closely related species.
Here we will align our fastq file against the Yersinia pseudotuberculosis genome, following all the steps from 4.1 to 4.4.
The edit distance must be lower when aligning the reads to the reference sequence compared to the closely related species.
To calculate the edit distance we will use BAMStats, a tool for summarising Next Generation Sequencing alignments. The commands to generate summary-charts, including the edit distance is:
BAMStats -i filename.rescaled.bam -v html -d -q -o outfile.html
Option |
Function |
|---|---|
-i filename |
SAM or BAM input file (must be sorted). |
-v html/simple |
View option for output format (currently accepts ‘simple’ or ‘html’; default, simple). |
-d |
If selected, edit distance statistics will also be displayed as a separate table (optional). |
-q |
If selected, mapping quality (MAPQ) statistics will also be displayed as a separate table (optional). |