Mapping tutorial for Porecamp 2016

In this tutorial we will explore the types of data that the MinION produces, and try to look at the error mode by visual inspection of alignments.

WARNING: You might get strange errors when copying and pasting commands from this file into a terminal window if the text contains double-quote characters. In these situations, please type in the double-quotes yourself.

Data

We’ll be using one sample sequenced for the Nick and Josh’s Ebola sequencing project.

The reads are already on the PoreCamp2016 server in /data/raw/hist/Ebola_R7 (downloaded from the ENA project PRJEB10571 sample ERR1014225).

Working directory

You will be putting files into a sub-directory of your home directory.

cd  
mkdir MappingTute  
cd MappingTute  

Extract the data using poretools

poretools is a tool for extracting and interrogating nanopore data published by Nick Loman and Aaron Quinlan.

First, we’ll extract the 2D pass reads in FASTA format:

poretools fasta --type 2D /data/raw/hist/Ebola_R7/reads/pass > Ebola2D.fasta

How many reads do we have?

grep ">" Ebola2D.fasta | wc -l

Download the reference genome

We can use wget to download a file on the web directory to our server

wget https://raw.githubusercontent.com/nickloman/ebov/master/refs/EM_079517.fasta

If you are running in X2Go run:

source ~/.bash_profile

And then create the bwa index file required to run bwa mem later

bwa index EM_079517.fasta

Map the reads

And now we actually map the reads, convert the SAM output to BAM format and then sort it by mapping coordinate (rather than read name) and save it as Ebola2D.sorted.bam and create the SAM index file required to run other samtools subtools later.

bwa mem -x ont2d EM_079517.fasta Ebola2D.fasta | samtools view -bS - | samtools sort -o Ebola2D.sorted.bam -

samtools index Ebola2D.sorted.bam

Basic QC of the data

As a first QC of the aligned reads, we can run samtools stats

samtools stats Ebola2D.sorted.bam > Ebola2D.stats.txt  
head -n 40 Ebola2D.stats.txt

How many reads were mapped?
What was the average length of the reads?

We can also plot the read depth across the reference genome by using the output of samtools stats and then plotting in Rstudio

grep "^COV" Ebola2D.stats.txt > Ebola2D.coverage.txt

First, in a web browser, open 147.188.173.136:8773 then type in your group username and password. Then Rstudio should open for you and you can type the following:

library(ggplot2)  
cov=read.table("/path/to/your/Ebola2D.coverage.txt", sep="\t")  
cov[1,]  
ggplot(cov, aes(x=V3, y=V4)) + geom_bar(stat='identity') + xlab('coverage') + ylab('count')

You could also do something similar using the output of samtools depth if you have time later.

Consolidating your knowledge

Now, repeat this process from the beginning, but do it for a different dataset, choose from:

All 1D pass reads (hint: –type fwd,rev)
All pass forward reads (hint: –type fwd)
All pass reverse reads (hint: –type rev)
All 2D fail reads (hint: –type 2D, use the fail directory)

1D reads only! Ensure you use a different file name, e.g. Ebola1D.fasta

How does the number of reads change?
How does the mapping frequency change?

Inspecting alignments

Now, let’s download the BAM file and inspect the alignment. My favoured tool for this is Tablet. It requires Java.

https://ics.hutton.ac.uk/tablet/

You need to copy the files from the server to your laptop using a GUI interface like WinSCP or filezilla or use the scp command in a terminal window or PuTTY by typing the following on your laptop:

cd /path/to/your/workingdirectory  
scp groupX@147.188.173.136:/path/to/your/something.bam .  

You need to load the following two files into Tablet:

alignment file: Ebola2D.sorted.bam reference file: EM_079517.fasta

Inspect the alignment.

Did the alignment confirm your earlier suspicions about how the sample was prepared?
What are the pros and cons of this approach?
Which regions might you be suspicious of?

Have a look at the error profile. Are some parts of the genome better than others? Can you correlate this with the sequence?

Variant calling

The Ebola virus mutation rate is in the order of 1.2 x 10^-3 mutations/site/year. The genome size is 19000 bases long. This sample was collected about a year after the reference genome. Approximately how many SNPs do you expect to see?

Call SNPs - by eye!

Make a list of SNPs - which ones are hard to assess?

Variant calling with nanopolish

Calling variants with nanopolish relies on squiggle data to generate the best consensus and gives a nicer result.

To call variants, there are three steps:

align the reads with BWA (or another aligner, such as marginAlign, or LAST)
align the events with nanopolish eventalign
call a VCF with nanopolish variants

Copy the model files into your current directory from: /data2/models/ into your current directory.

cp /data2/models/* .

We’ve already aligned the reads (output file from BWA was Ebola2D.sorted.bam)

nanopolish-r7 eventalign --reads Ebola2D.fasta -b Ebola2D.sorted.bam -g EM_079517.fasta --sam | samtools view -bS - | samtools sort -o Ebola2D.eventalign.bam -

We need to index the new BAM file that nanopolish eventalign produced:

samtools index Ebola2D.eventalign.bam

And now we need to get the variants in VCF format:

nanopolish-r7 variants --progress -t 1 --reads Ebola2D.fasta -o Ebola2D.eventalign.vcf -b Ebola2D.sorted.bam -e Ebola2D.eventalign.bam -g EM_079517.fasta -vv -w EM_079517:0-20000 --snp

It is actually possible to use different models with nanopolish variants specifying the model filenames –models-fofn offset_models.fofn. In this case we swap the original 5-mer model for a 6-mer model.

Compare this list with the list of variants that you already eyeballed. How do they compare?

Did nanopolish spot things that you didn’t?

Did nanopolish get anything wrong? Could you figure out a way of filtering the VCF to remove these errors?

To get the consensus sequence from the reference, vcf and bam file:

/home/ubuntu/scripts/margin_cons.py EM_079517.fasta Ebola2D.eventalign.vcf Ebola2D.sorted.bam > Ebola2D.eventalign.consensus.fasta 2> Ebola2D.eventalign.variants.txt

SNP calling with 6-mer model

nanopolish-r7 variants --progress -t 1 --reads Ebola2D.fasta -o Ebola2D.6mer.vcf -b Ebola2D.sorted.bam -e Ebola2D.eventalign.bam -g EM_079517.fasta -vv -w "EM_079517:0-20000" --snp --models-fofn offset_models.fofn

How does the new VCF Ebola2D.6mer.vcf look compared with the old one?

Software versions

This tutorial was tested with the following software versions:

samtools version 1.3.1
bwa version 0.7.15-r1140

porecamp.github.io

PoreCamp - a training bootcamp based on the Oxford Nanopore MinION