10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.

Slides:



Advertisements
Similar presentations
15.2 Recombinant DNA.
Advertisements

10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Biotechnology Chapter 11.
DNAseq analysis Bioinformatics Analysis Team
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Chapter 9: Biotechnology
High Throughput Sequencing
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Genetic technology Unit 4 Chapter 13.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genetic Engineering.
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
DNA Technology Chapter 20.
Restriction Enzymes Enzymes that CUT
15.2 Recombinant DNA Or How to Mess with DNA for Fun and Profit.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
DNA TECHNOLOGY AND BIOTECHNOLOGY PAGES Chapter 10.
DNA Biotechnology. Cloning A clone is a group of living organisms that come from one parent and are genetically identical Can occur naturally or artificially.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
15.2, slides with notes to write down
Studying and Manipulating Genomes Chapter 11. Golden Rice or Frankenfood? Scientists transferred daffodil genes into rice Rice with beta-carotene may.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Introduction to RNAseq
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Genetic Engineering Genetic engineering is also referred to as recombinant DNA technology – new combinations of genetic material are produced by artificially.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
DNA Technology Ch. 20. The Human Genome The human genome has over 3 billion base pairs 97% does not code for proteins Called “Junk DNA” or “Noncoding.
DNA Technology Terminology USES of DNA technology DNA fingerprinting protein production gene therapy GMO - Genetically Modified Organisms cloning Stem.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
DNA Technology. I. Vectors: Things used to transport genes into cells.
9.1 Manipulating DNA KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
 It’s your future - the world you will be growing up in, the world you will be taking over for future generations  To prevent and treat genetic diseases,
Objectives: Outline the steps involved in sequencing the genome of an organism. Outline how gene sequencing allows for genome wide comparisons between.
From Reads to Results Exome-seq analysis at CCBR
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Interpreting exomes and genomes: a beginner’s guide
Biotechnology.
Bioethics Writing Assignment
Lesson: Sequence processing
Chapter 9: Biotechnology
Quality Control & Preprocessing of Metagenomic Data
CSE 182 Project.
Aim: What are some applications of Genetic Engineering?
15.2, slides with notes to write down
DNA Technology Ch 13.
DO all dogs come from wolves?
New genes can be added to an organism’s DNA.
Chapter 14 Bioinformatics—the study of a genome
DNA Technology.
Scientists use several techniques to manipulate DNA.
14-3 Human Molecular Genetics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Genetically Modified Organisms
BF528 - Genomic Variation and SNP Analysis
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics

Genome Exome Transcriptome Metagenome

Differences between … Individuals in populations Child and parents Cancer and host genome Large pedigrees of animals Bacterial populations inside individuals Bacterial populations in the world

Real world problems … What is wrong with this new born child? Why are these cells cancerous and what should we do about it? We have 6,000 individuals in 1,500 families with cleft-palate – what causes this?

Real world problems … There is a hard to treat infectious disease in a hospital ward – where did it come from and is it the same as the one at another hospital? Is this water safe to drink? …

Human Genome 3 billion nucleotides Exome 30 million nucleotides

Shapes of the Jigsaw Pieces

Differences between human genomes - SNPs A C G T T A G T G A A C G T T C G T G A A C G T T G G T G A ~ 1 / 1,000 3,000,000 nt

Differences between human genomes - MNPs A C G T T A G T G A A C G T T C A G A A C G T T G T G A

Differences between human genomes - indels A C G T T A G T G A A C G T T G T G A A C G T T G G T G A ~ 1 / 10, ,000

Differences between human genomes - inserts A C G T T A G T G A Up to 1,000,000 nt total 3,000,000 nt T T A G G A C C C A

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Solving the Jigsaw Indexing Alignment SNP/MNP/Indel calling Mapping

Indexing A C G T T A G T G A A G A C G T T C G T G A A G A C G T T C G T G A A G A C G T T A G T G A A G 4.5 billion

Aligning A C G T T A G T G A A G A C G T T C G T G A A G 1.6 billion

Cutting Edge Run Human genome (3 billion nt) 1 billion reads of 100 nt coverage of 30 Indexing + Aligning in 27 minutes

i7 Quad Core

2 sockets X 4 cores X 2 hyperthreads = GB RAM 10 computers 1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB X thousands of genomes

Shapes of the Jigsaw Pieces

Paired End Reads 100 nt ,000 nt Index Align Index Align Match

Solving the Jigsaw without the picture Indexing Alignment Assembly

T A G T G A A G A A T T A C G T T C G T G A A G A C G T T C G T G A A G T A G T G A A G A A T T A C G T T ? G T G A A G A A T T

SNP calling 15A13CAC heterozygous SNP 15A4C 5A2C 1A2C Bayesian statistics (SNPs 1/1,000) 31A42C Throw it out

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Lane Multiple technologies and read lengths SAM Calibration Mapping SNP calling VCF SNPs, MNPS, indels Filtering Complex regions

SNP calling - Diploid Bayesian SAMGenome statisticsCalibration Error model Priors Bayesian Model A C G T A:C A:G A:T C:G C:T G:T … log posteriors Counts filterAmbiguity filter VCF Simple isolated SNP insertAdjacent SNPs, inserts Complex region calling SNPs, indels, MNPs

Complex Region Calling Genome Aligned Reads Modified Genome Probabilistic realignment through all paths for each read against each modified genome

Comparing twins 3,000,000 SNPs Do any of them differ between the twins? 15A 4C 3A 10C 3G

DNA mRNA protein Gene

Cancer comparison

Copy Number Variants Varying levels of extraction of reads across genome (use differences) Locate boundaries (as accurately as possible) Extract number of variants Use in combination with calling SNPs

Large pedigrees

Chlorocebus pygerythrus

Metagenomics or what is living on you Mapping reads back onto a database of known bacteria/viruses Many are ambiguous Many don’t map at all Estimate frequency of each species Remove human “contamination”

TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Akkermansia muciniphila ATCC BAA gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ | Bifidobacterium adolescentis ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Bacteroides fragilis NCTC 9343 plasmid pBF gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Eubacterium rectale ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Bifidobacterium longum NCC gi| |ref|NC_ |Bifidobacterium longum DJO10A

Metagenomics Map reads to database Estimate most likely frequencies a hill climbing estimation problem Can anything be done about unmapped reads?

How do we get there? Software engineering (500,000 lines code) Algorithms Bayesian statistics Testing calibration/simulation/analysis

How do we get there? Performance optimization algorithms disk I/O and compression parallel execution optimization for memory size optimization for cache size targeted code optimization