10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Next-generation sequencing
Genomics, Cancers & Infectious Diseases Qunyuan Zhang Division of Statistical Genomics Washington University School of Medicine.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
BNFO 602 Lecture 1 Usman Roshan.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Chapter 9: Biotechnology
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
High Throughput Sequencing
Introduction to Animal Breeding & Genomics Sinead McParland Teagasc, Moorepark, Ireland.
Genetic technology Unit 4 Chapter 13.
The Clone Age Human Genome Project Recombinant DNA Gel Electrophoresis DNA fingerprints
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
A Look at Genetic Engineering and Biotechnology.
DNA Technology Chapter 20.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Genomics Analysis Chapter 20 Overview of topics to be discussed  The Human Genome Analysis  Variable Number Tandem Repeats  Short Tandem Repeats 
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
DNA TECHNOLOGY AND BIOTECHNOLOGY PAGES Chapter 10.
DNA Biotechnology. Cloning A clone is a group of living organisms that come from one parent and are genetically identical Can occur naturally or artificially.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Identification of Copy Number Variants using Genome Graphs
15.2, slides with notes to write down
Studying and Manipulating Genomes Chapter 11. Golden Rice or Frankenfood? Scientists transferred daffodil genes into rice Rice with beta-carotene may.
Neanderthals Noonan, et al. Sequencing and Analysis of Neanderthal Genomic DNA Green, et al. Analysis of one million base pairs of Neanderthal DNA Kristine.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Genomics Chapter 18.
DNA Technology Terminology USES of DNA technology DNA fingerprinting protein production gene therapy GMO - Genetically Modified Organisms cloning Stem.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Selective Breeding and Natural Selection. DNA Technology.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
KEY CONCEPT DNA sequences of organisms can be changed.
9.1 Manipulating DNA KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Canadian Bioinformatics Workshops
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
A comparison of somatic mutation callers in breast cancer samples and matched blood samples THOMAS BRETONNET BIOINFORMATICS AND COMPUTATIONAL BIOLOGY UNIT.
Risheng Chen et al BMC Genomics
Interpreting exomes and genomes: a beginner’s guide
Biotechnology.
Bioethics Writing Assignment
Chapter 9: Biotechnology
Gil McVean Department of Statistics
15.2, slides with notes to write down
DO all dogs come from wolves?
New genes can be added to an organism’s DNA.
Scientists use several techniques to manipulate DNA.
What is Technology?.
14-3 Human Molecular Genetics
2nd (Next) Generation Sequencing
Linking Genetic Variation to Important Phenotypes
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT DNA sequences of organisms can be changed.
Selective Breeding Selecting few organisms with desired traits to be parents of the next generation Techniques Inbreeding: crossing two individuals with.
BF528 - Genomic Variation and SNP Analysis
BF nd (Next) Generation Sequencing
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Canadian Bioinformatics Workshops
KEY CONCEPT DNA sequences of organisms can be changed.
KEY CONCEPT DNA sequences of organisms can be changed.
SNPs and CNPs By: David Wendel.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Mapping of srt1 by BSA-seq.
Presentation transcript:

10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics

million 100 thousand 10 thousand 10 million 100 million billion 10 billion 100 billion thousand hundred

Genome Transcriptome Cancer

Genomes of … human reference species mouse, chimp, arabidopsis… agricultural species cattle, sheep, pig, … rice, wheat, grape … bacterial disease, human “ecosystem”

Differences between … Individuals Populations disease and “quantitative traits” Somatic and tumor genomes Transcriptome of child and parents Bacterial populations of individuals

Human Genome 3 billion Nucleotides

Shapes of the Jigsaw Pieces CompanyLengths (nt) Illumina Complete Genomics36 Ion Torrentupto 200 Oxford Nanopore(?)upto 50,000 Pacific Biosciences100*

Differences between genomes - SNPs A C G T T A G T G A A C G T T C G T G A A C G T T G G T G A ~ 1 / 1,000 3,000,000 nt

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Differences between human genomes - MNPs A C G T T A G T G A A C G T T C A G A A C G T T G T G A

Differences between human genomes - indels A C G T T A G T G A A C G T T G T G A A C G T T G G T G A ~ 1 / 10, ,000

Differences between genomes - inserts A C G T T A G T G A Up to 1,000,000 nt total 3,000,000 nt T T A G G A C C C A

Differences between genomes – structural variants Tandem Repeat Inversion Copy

Solving the Jigsaw Indexing Alignment SNP/MNP/Indel/SV calling Mapping

Indexing A C G T T A G T G A A G A C G T T C G T G A A G A C G T T C G T G A A G A C G T T A G T G A A G 4.5 billion

Aligning A C G T T A G T G A A G A C G T T C G T G A A G 1.6 billion

Cutting Edge Run Human genome (3 billion nt) 1 billion reads of 100 nt coverage of 30 Indexing + Aligning in 27 minutes

i7 Quad Core

2 sockets X 4 cores X 2 hyperthreads = GB RAM 10 computers 1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB X thousands of genomes

Shapes of the Jigsaw Pieces CompanyLengths (nt) Illumina Complete Genomics36 Ion Torrentupto 200 Oxford Nanopore(?)upto 50,000 Pacific Biosciences100*

Paired End Reads 100 nt ,000 nt Index Align Index Align Match 100 nt

Solving the Jigsaw without the picture Indexing Alignment Assembly

T A G T G A A G A A T T A C G T T C G T G A A G A C G T T C G T G A A G T A G T G A A G A A T T A C G T T ? G T G A A G A A T T

SNP calling 15A13CAC heterozygous SNP 15A4C 5A2C 1A2C Bayesian statistics (SNPs 1/1,000) 31A42C Throw it out

REF: aatgttttctcagaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttgcgtagttagtgttcgtgctgg SIM: T AAGAAT CALL: T G CALL: T T READ: ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC READ: TTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA READ: TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG READ: CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A READ: AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC READ: ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA READ: GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA READ: TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT READ: GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT READ: AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA READ: CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA READ: TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA READ: TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT READ: GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC READ: CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG READ: _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG READ: TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG READ: GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG READ: TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA READ: GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT READ: GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG READ: TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA READ: CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT READ: TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT READ: CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG READ: TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGT READ: AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG READ: AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ: ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTTCG

Comparing twins 3,000,000 SNPs Do any of them differ between the twins? 15A 4C 3A 10C 3G

DNA mRNA protein Gene

Cancer comparison

Copy Number Variants Varying levels of extraction of reads across genome (use differences) Locate boundaries (as accurately as possible) Extract number of variants Use SNPs

Metagenomics or what is living on you Mapping reads back onto a database of known bacteria/viruses Many are ambiguous Many don’t map at all Estimate frequency of each species Remove human “contamination”

TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Akkermansia muciniphila ATCC BAA gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ | Bifidobacterium adolescentis ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Bacteroides fragilis NCTC 9343 plasmid pBF gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Eubacterium rectale ATCC TS gi| |ref|NC_ | Bacteroides thetaiotaomicron VPI-5482 plasmid p gi| |ref|NC_ | Bacteroides vulgatus ATCC gi| |ref|NC_ |Campylobacter jejuni subsp. jejuni plasmid pTet gi| |ref|NC_ |Bifidobacterium longum NCC gi| |ref|NC_ |Bifidobacterium longum DJO10A

Metagenomics Map reads to database Estimate most likely frequencies a hill climbing estimation problem Can anything be done about unmapped reads?