Notes from Nick’s intro for bioinformatics on Thu 18 Aug Porecamp 2016

QC approaches

Issues

read lengths, read number, data yield
blast a few reads to see what it is

Software

Mapping vs Assembly

Issues

might not have a reference
consensus assembly may not capture the diversity in the system

Assembly better when have

no references
lots of repeats between query sequence and reference genome
polymorphisms
structural variations

Mapping-based approaches better when

population-genetic studies of SNPs
get high coverage and you get to see the alignments

Mapping

Mapping-related software

nanook (QC of mapped reads, gives nice reads)
[bwa] mem -x ont2d (http://bio-bwa.sourceforge.net/)
smalt (used by class by illumina only far)
blasr (written for PacBio)
[graphmap] (https://github.com/isovic/graphmap)
bowtie2 (Nick: optimised for short reads)
geneious (wrapper for existing software)
blast
vicuna
diamond (for short reads in protein space, an accelerated blastx, output can be input to megan)

How to call variants after mapping?

samtools mpileup
GATK
nanopolish

De novo assembly

How much data do you need for an assembly

only really need 10x coverage (lambda-waterman statistics - how much data do you need to see every part of the genome at least once - for human, need about 7-8x coverage)

Assembler types

OLC assemblers (overlap layout consensus, CANU best for nanopore)
de Bruijn assemblers (uses k-mers)

Software

CANU (new celera assembler for long reads)
miniasm (OL (no consensus) assembly - very fast, but no correction stage)
racon
IDBA-UD
busco (for eukaryotes)
velvet (for short reads)
ALLPATHS-LG

Research and development

Jared working on getting near-perfect de novo genomes - want to get to 99.99999%

Typical de novo pathway

nanopore reads -> de novo assembly -> de novo error correction -> polished assembly
nanopore reads -> de novo assembly -> short-read error correction -> polished assembly
nanopore reads -> CANU or miniasm -> assembly
nanopore reads -> miniasm -> assembly -> racon -> polished assembly
nanopore reads -> CANU or miniasm -> assembly -> assembly + events -> nanopolish -> polished assembly
nanopore reads -> CANU or miniasm -> assembly -> assembly + short reads -> pilon -> polished assembly

Typical hybrid assembly pathway

nanopore reads + illumina -> spades -> polished assembly

Species identification (taxonomic assignment of reads)

kraken (WIMP is a Metrichor workflow based on kraken)
megan (16S)
MetaPhlAn

Genome Annotation

prokka

CLIMB

Can download PoreCamp2016 CLIMB image, then run it on virtual box or Amazon services.

Q and A

Where do the short fragments come from in the read length distribution?

Best theory is that it’s from the bead beating to break the cell wall of gram-positive bacteria and extract the DNA.

porecamp.github.io

PoreCamp - a training bootcamp based on the Oxford Nanopore MinION