Problem 1: Analysis is done after sequencing. Sequencing Analysis
Problem 2: Much of your data is unnecessary. Shotgun data is randomly sampled; So, you need high coverage for high sensitivity.
Problem 3: Current variant cal ing approaches are multipass Data Mapping Sorting Cal ing Answer
Problem 4: Al elic mapping bias favors reference genome. Number of nbh differentiating polymorphisms. Stevenson et al., 2013 (BMC Genomics)
Problem 5: Current approaches are often insensitive to indels Iqbal et al., Nat Gen 2012
Why are we concerned at al !? Looking forward 5 years… Navin et al., 2011
Some basic math: • 1000 single cel s from a tumor… • …sequenced to 40x haploid coverage with Il umina… • …yields 120 Gbp each cel … • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant cal ing wil require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this al in one month.
Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: • Volume of data & compute infrastructure; • Latency for clinical applications.
Can we improve this situation? • Tie directly into machine as it generates sequence (Illumina, PacBio, and Nanopore can al do streaming, in theory) • Analyze data as it comes off; for some (many?) applications, can stop run early if signal detected. • Avoid using a reference genome for primary variant cal ing. • Easier indel detection, less allelic mapping bias • Can use reference for interpretation. Does such a magical approach exist!?
~Digression: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkil !! The high-coverage reads in sample A are unnecessary for assembly, and, in fact, distract.
Digital normalization is streaming
Some key points -- • Digital normalization is streaming. • Digital normalizing is computational y efficient (lower memory than other approaches; paral elizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant cal ing.
Assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
Anecdata: diginorm is used in Il umina long-read sequencing (?)
Diginorm is “lossy compression” • Nearly perfect from an information theoretic perspective: • Discards 95% more of data for genomes. • Loses < 00.02% of information.
Digital normalization => graph alignment What we are actual y doing this stage is building a graph of al the reads, and aligning new reads to that graph.
Error correction via graph alignment Jason Pell and Jordan Fish
Error correction on simulated E. coli data TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Jordan Fish and Jason Pell
Error correction variant calling Single pass, reference free, tunable, streaming online variant cal ing.
Analysis is done after sequencing. Sequencing Analysis
Streaming with bases k bases... k+1 k+2 k bases... k+1 k bases... k+1 ... Graph k bases... k+1 k bases... k+1 k bases... k+1 Variants
Integrate sequencing and analysis Are we done yet? Sequencing Analysis
Streaming approach also supports more compute- intensive interludes – remapping, etc. Rimmer et al., 2014
Streaming algorithms can be very efficient Data 1-pass Answer See also eXpress, Roberts et al., 2013.
So: reference-free variant cal ing • Streaming & online algorithm; single pass. • For real-time diagnostics, can be applied as bases are emitted from sequencer. • Reference free: independent of reference bias. • Coverage of variants is adaptively adjusted to retain al signal. • Parameters are easily tuned, although theory needs to be developed. • High sensitivity (e.g. C=50 in 100x coverage) => poor compression • Low sensitivity (C=20) => good compression. • Can “subtract” reference => novel structural variants. • (See: Cortex, Zam Iqbal.)
Two other features -- • More single-computer scalable approach than current: low disk access, high paral elizability. • Openness – our software is free to use, reuse, remix; no intel ectual property restrictions. (Hence “We hear Il umina is using it…”)
Prospectus for streaming variant detection • Underlying concept is sound and offers many advantages over current approaches; • We have proofs of concept implemented; • We know that underlying approach works well in amplification situations, as wel ; • Tuning and math/theory needed! • …grad students keep on getting poached by Amazon and Google. (This is becoming a serious problem.)
"Information" "Information" Raw data Compression (~10-100 GB) Analysis "Information" "Information" (~2 GB) ~1 GB "Information" Database & integration Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
"Information" "Information" Raw data Compression (~10-100 GB) Analysis "Information" "Information" (~2 GB) ~1 GB "Information" Database & integration Save in cold storage Save for reanalysis, investigation.
Data integration? Once you have al the data, what do you do? "Business as usual simply cannot work." Looking at mil ions to bil ions of genomes. (David Haussler, 2014)
Data recipes Standardized (versioned, open, remixable, cloud) pipelines and protocols for sequence data analysis. See: khmer-recipes, khmer-protocols. Increases buy-in :)
Training! Lots of training planned at Davis – open workshops. ivory.idyll.org/blog/2014-davis-and-training.html Increases buy-in x 2!
Acknowledgements Lab members involved Col aborators • Jim Tiedje, MSU • Billie Swalla, UW • Adina Howe (w/Tiedje) • Janet Jansson, LBNL • Jason Pell • Susannah Tringe, JGI • Qingpeng Zhang • Eran Andrechek, MSU • Tim Brom • Jordan Fish Funding • Michael Crusoe USDA NIFA; NSF IOS; NIH NHGRI; NSF BEACON.