Notes from Nick’s intro for bioinformatics on Thu 18 Aug Porecamp 2016
QC approaches
Issues
- read lengths, read number, data yield
- blast a few reads to see what it is
Software
Mapping vs Assembly
Issues
- might not have a reference
- consensus assembly may not capture the diversity in the system
Assembly better when have
- no references
- lots of repeats between query sequence and reference genome
- polymorphisms
- structural variations
Mapping-based approaches better when
- population-genetic studies of SNPs
- get high coverage and you get to see the alignments
Mapping
Mapping-related software
- nanook (QC of mapped reads, gives nice reads)
- [bwa] mem -x ont2d (http://bio-bwa.sourceforge.net/)
- smalt (used by class by illumina only far)
- blasr (written for PacBio)
- [graphmap] (https://github.com/isovic/graphmap)
- bowtie2 (Nick: optimised for short reads)
- geneious (wrapper for existing software)
- blast
- vicuna
- diamond (for short reads in protein space, an accelerated blastx, output can be input to megan)
How to call variants after mapping?
- samtools mpileup
- GATK
- nanopolish
De novo assembly
How much data do you need for an assembly
- only really need 10x coverage (lambda-waterman statistics - how much data do you need to see every part of the genome at least once - for human, need about 7-8x coverage)
Assembler types
- OLC assemblers (overlap layout consensus, CANU best for nanopore)
- de Bruijn assemblers (uses k-mers)
Software
- CANU (new celera assembler for long reads)
- miniasm (OL (no consensus) assembly - very fast, but no correction stage)
- racon
- IDBA-UD
- busco (for eukaryotes)
- velvet (for short reads)
- ALLPATHS-LG
Research and development
- Jared working on getting near-perfect de novo genomes - want to get to 99.99999%
Typical de novo pathway
- nanopore reads -> de novo assembly -> de novo error correction -> polished assembly
- nanopore reads -> de novo assembly -> short-read error correction -> polished assembly
- nanopore reads -> CANU or miniasm -> assembly
- nanopore reads -> miniasm -> assembly -> racon -> polished assembly
- nanopore reads -> CANU or miniasm -> assembly -> assembly + events -> nanopolish -> polished assembly
- nanopore reads -> CANU or miniasm -> assembly -> assembly + short reads -> pilon -> polished assembly
Typical hybrid assembly pathway
- nanopore reads + illumina -> spades -> polished assembly
Species identification (taxonomic assignment of reads)
Genome Annotation
CLIMB
Can download PoreCamp2016 CLIMB image, then run it on virtual box or Amazon services.
Q and A
Where do the short fragments come from in the read length distribution?
- Best theory is that it’s from the bead beating to break the cell wall of gram-positive bacteria and extract the DNA.