Data Extraction and QC
- File formats
- Using HDF5 command line tools
- Browsing HDF5 files
- Basic QC in poRe
- Extracting metadata using poRe
- Parsing the sequencing summary file
- Extracting FASTQ/A using Poretools
- Extracting FASTQ/A using Nanopolish
- Extracting FASTQ/A using poRe
- Extracting FASTQ/A and Metadata using poRe GUIs
File formats
Oxford Nanopore are very bad at releasing official definitions of file formats, therefore much guess work is involved. Most of the time (still true as I type…) working with ONT data means working with FAST5 files - these are in fact HDF5 files, a binary, compressed format that stores structured data in a single file and allows random access to subsets of that data. With a plethora of base callers now available, and several iterations of MinKNOW and the ONT chemistry, it really is very hard to keep up with all of the different FAST5 formats. Expect some difficulty.
Whilst ONT base-callers now output FASTQ, it is not well formatted FASTQ and we currently don’t recommend using it.
Using HDF5 command line tools
h5ls and h5dump can be quite useful.
h5ls reveals the structure of fast5 files.
h5ls /vol_c/public_data/minion_ecoli_sample/nanopore2_20170301_FNFAF09967_MN17024_mux_scan_170301_MG1655_PC_RAD002_76964_ch420_read41_strand.fast5
Analyses Group
Raw Group
UniqueGlobalKey Group
Adding the -r flag makes this recursive
h5ls -r /vol_c/public_data/minion_ecoli_sample/nanopore2_20170301_FNFAF09967_MN17024_mux_scan_170301_MG1655_PC_RAD002_76964_ch420_read41_strand.fast5
/ Group
/Analyses Group
/Analyses/Basecall_1D_000 Group
/Analyses/Basecall_1D_000/BaseCalled_template Group
/Analyses/Basecall_1D_000/BaseCalled_template/Events Dataset {37390}
/Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR}
/Analyses/Basecall_1D_000/Configuration Group
/Analyses/Basecall_1D_000/Configuration/basecall_1d Group
/Analyses/Basecall_1D_000/Summary Group
/Analyses/Basecall_1D_000/Summary/basecall_1d_template Group
/Analyses/Segmentation_000 Group
/Analyses/Segmentation_000/Configuration Group
/Analyses/Segmentation_000/Configuration/event_detection Group
/Analyses/Segmentation_000/Configuration/stall_removal Group
/Analyses/Segmentation_000/Summary Group
/Analyses/Segmentation_000/Summary/segmentation Group
/Raw Group
/Raw/Reads Group
/Raw/Reads/Read_41 Group
/Raw/Reads/Read_41/Signal Dataset {192202/Inf}
/UniqueGlobalKey Group
/UniqueGlobalKey/channel_id Group
/UniqueGlobalKey/context_tags Group
/UniqueGlobalKey/tracking_id Group
Unsurprisingly h5dump dumps the entire file to STDOUT
h5dump /vol_c/public_data/minion_ecoli_sample/nanopore2_20170301_FNFAF09967_MN17024_mux_scan_170301_MG1655_PC_RAD002_76964_ch420_read41_strand.fast5
Browsing HDF5 files
Any HDF5 file can be opened using hdfview and browsed/edited in a GUI
hdfview /vol_c/public_data/minion_ecoli_sample/nanopore2_20170301_FNFAF09967_MN17024_mux_scan_170301_MG1655_PC_RAD002_76964_ch420_read41_strand.fast5 &
Basic QC in poRe
poRe is a library for R available from SourceForge and published in bioinformatics. poRe is incredibly simple to install and relies simply on R 3.0 or above and a few additional libraries.
The poRe library is set up to read v1.1 data by default, and offers users parameters to enable reading of v1.0 data.
You can start R from the command-line:
R
Alternatively, navigate to the Rstudio server for your given VM using any browser on your laptpop:
Log-in with your linux credentials
Then, within R:
library(poRe)
We can read a pre-computed meta-data file
meta <- read.table("/vol_c/public_data/minion_brown_metagenome/brown_metagenome.meta.txt", sep="\t", header=TRUE)
And plot yield over time
yield <- plot.cumulative.yield(meta)
And we can dp some ggplots too
library(ggplot2)
# 2D over time
ggplot(yield, aes(x=time,y=cum.2d)) + geom_line()
# all over time
melty <- melt(yield, id="time", variable.name="read.type", value.name="length")
ggplot(melty, aes(x=time,y=length, color=read.type)) + geom_line()
We can plot read lengths:
lens <- plot.length.histogram(meta)
To be honest, these are pretty bad, so there are some alternatives:
ggplot(data=meta, aes(tlen)) + geom_histogram()
Or plot all three:
meltl <- melt(meta[,c("filename","tlen","clen","len2d")], id="filename",
variable.name="read.type",
value.name="length")
# Overlaid histograms
ggplot(meltl, aes(x=length, fill=read.type))
+ geom_histogram(binwidth=1000, alpha=.5, position="identity")
# Interleaved histograms
ggplot(meltl, aes(x=length, fill=read.type))
+ geom_histogram(binwidth=1000, position="dodge")
# Density plots
ggplot(meltl, aes(x=length, colour=read.type))
+ geom_density()
# Density plots with semi-transparent fill
ggplot(meltl, aes(x=length, fill=read.type))
+ geom_density(alpha=.3)
Obviously this is old data and 2D has been retired, so let’s look at Nick’s whale data which is more recent:
# to do
Extracting metadata using poRe
By far the easiest and quickest way to extract metadata in poRe is by using the extractMeta script from the poRe_scripts repo:
extractMeta -h
This is a compute intensive operation, so it’s best to use multiple cores:
extractMeta -c 4 /vol_c/public_data/minion_brown_metagenome > brown_metagenome.meta.txt
Or for a larger sample
extractMeta -c 4 /vol_c/public_data/minion_ecoli_sample > ecoli_sample.meta.txt
These can then be loaded and visualised in R as above
Parsing the sequencing summary file
After base-calling with Albacore, you’ll find a file called sequencing_summary.txt. We can read this in R:
ss <- read.table("/mnt/porecamp_usa/outputs/monday_test_2/sequencing_summary.txt", sep="\t", header=TRUE)
head(ss)
Then we can plot yield over time:
# order by start time
so <- ss[order(ss$start_time),]
# get X and Y vectors
tplot <- so$start_time - mind
ctlen <- cumsum(so$sequence_length_template)
# plot it
plot(tplot, ctlen)
Extracting FASTQ/A using Poretools
Poretools is a somewhat popular package for extracting data and information from FAST5 files. Extracting FASTA and FASTQ is easy:
# extract FASTQ
poretools fastq directory_of_fast5
# extract FASTA
poretools fasta directory_of_fast5
The default is to extract all types of sequence data (template, complement and 2D) if present. You can control this with –type:
poretools fastq --type all test_data/
poretools fastq --type fwd test_data/
poretools fastq --type rev test_data/
poretools fastq --type 2D test_data/
poretools fastq --type fwd,rev test_data/
Extracting FASTQ/A using Nanopolish
Nanopolish has many cool features, but the one we will focus on here is FASTQ and FASTA extraction
nanopolish extract directory_of_fast5
By default nanopolish extracts FASTA. If you want FASTQ, then
nanopolish extract -q directory_of_fast5
Nanopolish defaults to “2d-or-template”, and this can be controlled using the –type option:
nanopolish extract -q --type=template directory_of_fast5
nanopolish extract -q --type=complement directory_of_fast5
nanopolish extract -q --type=2d directory_of_fast5
To recursively extract from a diretory, use -r
nanopolish extract -r directory_of_fast5
Extracting FASTQ/A using poRe
Again, we can do this using the extractSequence script from the poRe_scripts repo:
extractSequence -h
This tool will extract FASTA/Q which is identical in form to nanopolish, which is useful for downstream processing
Extracting FASTQ/A and Metadata using poRe GUIs
Start up R or RStduio and run:
pore_parallel()