10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics
Genome Exome Transcriptome Metagenome
Differences between … Individuals in populations Child and parents Cancer and host genome Large pedigrees of animals Bacterial populations inside individuals Bacterial populations in the world
Real world problems … What is wrong with this new born child? Why are these cells cancerous and what should we do about it? We have 6,000 individuals in 1,500 families with cleft-palate – what causes this?
Real world problems … There is a hard to treat infectious disease in a hospital ward – where did it come from and is it the same as the one at another hospital? Is this water safe to drink? …
Human Genome 3 billion nucleotides Exome 30 million nucleotides
Shapes of the Jigsaw Pieces
Differences between human genomes - SNPs A C G T T A G T G A A C G T T C G T G A A C G T T G G T G A ~ 1 / 1,000 3,000,000 nt
Differences between human genomes - MNPs A C G T T A G T G A A C G T T C A G A A C G T T G T G A
Differences between human genomes - indels A C G T T A G T G A A C G T T G T G A A C G T T G G T G A ~ 1 / 10, ,000
Differences between human genomes - inserts A C G T T A G T G A Up to 1,000,000 nt total 3,000,000 nt T T A G G A C C C A
REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG
Solving the Jigsaw Indexing Alignment SNP/MNP/Indel calling Mapping
Indexing A C G T T A G T G A A G A C G T T C G T G A A G A C G T T C G T G A A G A C G T T A G T G A A G 4.5 billion
Aligning A C G T T A G T G A A G A C G T T C G T G A A G 1.6 billion
Cutting Edge Run Human genome (3 billion nt) 1 billion reads of 100 nt coverage of 30 Indexing + Aligning in 27 minutes
i7 Quad Core
2 sockets X 4 cores X 2 hyperthreads = GB RAM 10 computers 1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB X thousands of genomes
Shapes of the Jigsaw Pieces
Paired End Reads 100 nt ,000 nt Index Align Index Align Match
Solving the Jigsaw without the picture Indexing Alignment Assembly
T A G T G A A G A A T T A C G T T C G T G A A G A C G T T C G T G A A G T A G T G A A G A A T T A C G T T ? G T G A A G A A T T
SNP calling 15A13CAC heterozygous SNP 15A4C 5A2C 1A2C Bayesian statistics (SNPs 1/1,000) 31A42C Throw it out
REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG
Lane Multiple technologies and read lengths SAM Calibration Mapping SNP calling VCF SNPs, MNPS, indels Filtering Complex regions
SNP calling - Diploid Bayesian SAMGenome statisticsCalibration Error model Priors Bayesian Model A C G T A:C A:G A:T C:G C:T G:T … log posteriors Counts filterAmbiguity filter VCF Simple isolated SNP insertAdjacent SNPs, inserts Complex region calling SNPs, indels, MNPs
Complex Region Calling Genome Aligned Reads Modified Genome Probabilistic realignment through all paths for each read against each modified genome
Comparing twins 3,000,000 SNPs Do any of them differ between the twins? 15A 4C 3A 10C 3G
DNA mRNA protein Gene
Cancer comparison
Copy Number Variants Varying levels of extraction of reads across genome (use differences) Locate boundaries (as accurately as possible) Extract number of variants Use in combination with calling SNPs
Large pedigrees
Chlorocebus pygerythrus
Metagenomics or what is living on you Mapping reads back onto a database of known bacteria/viruses Many are ambiguous Many don’t map at all Estimate frequency of each species Remove human “contamination”
TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Akkermansia muciniphila ATCC BAA gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ | Bifidobacterium adolescentis ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Bacteroides fragilis NCTC 9343 plasmid pBF gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Eubacterium rectale ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Bifidobacterium longum NCC gi| |ref|NC_ |Bifidobacterium longum DJO10A
Metagenomics Map reads to database Estimate most likely frequencies a hill climbing estimation problem Can anything be done about unmapped reads?
How do we get there? Software engineering (500,000 lines code) Algorithms Bayesian statistics Testing calibration/simulation/analysis
How do we get there? Performance optimization algorithms disk I/O and compression parallel execution optimization for memory size optimization for cache size targeted code optimization