From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research.

From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research

Stan Ulam’s Vision “Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”

The Gene Counting Problem “The more I see the less I know for sure.” John Lennon

The O.J. Simpson Problem No matter how good the data look, it is full of errors! David Botstein, Albuquerque, 1994

Science and Technology  Genomics  Comparative Genomics  Proteomics  Pharmacogenomics  Structural Genomics  Drug and Vaccine Design  DNA Expression Chips  Animal Models

GENOMICS

There is Nothing More Important than the Assembly! Gene Myers and Granger Sutton and the Assembly Team Celera Human Assembly: Largest Non-Defence Computation 20,000 CPU hours -- done 5 times now 160 Processors -- Compaq Architecture

Assembly Progression (Macro View)

Generally 3-10 pairs link each consecutive contig

Archaeoglobus fulgidus Methanobacterium thermoautotropicum Saccharomyces cerevisiae Mycoplasma pneumoniae 1996 19981997 Methanococcus jannaschii Mycoplasma genitalium Haemophilus influenzae Completed Microbial Genomes Treponema pallidum Borrelia burgdorferi Helicobacter pylori Bacillus subtilis Escherichia coli Aquifex aeolicus 1999 Aq Mycobacterium tuberculosis H37Rv Pyrococcus horikoshii (Mycobacterium tuberculosis CSU#93) (Deinococcus radiodurans) (Thermotoga maritima) (Rickettsia prowazekii) Chlamydia trachomatis Synechocystis sp.

GENOMES SEQUENCED AT TIGR and CELERA Pathogens *Haemophilus influenzae Rd *Mycoplasma genitalium *Helicobacter pylori *Borrelia burgdorferi *Treponema pallidum *Plasmodium flaciparum *Neisseria meningitidis *Chlamydia trachomatis *Chlamydia pneumoniae *Vibro cholerae *Streptococcus pneumoniae *Mycobacterium tuberculosis *Porphyromonas gingivalis *Trypanosoma brucei *Staphylococcus aureus *Enterococcus faecalis *Porphyromonas gingivalis *Chlamydia psittaci Plants *Arabidopsis thaliana Environment *Methanococcus jannaschii *Archaeoglobus fulgidus *Thermotoga maritima *Deinococcus radiodurans *Chlorobium tepidum *Caulobacter crescentus *Shewanella putrafaciens *Desulfovibrio vulgaris *Pseudomonas putida Insects **Drosophila melanogaster Mammals **Human **Mouse * The Institute for Genomic Research ** Celera Genomics

Genesis of Celera August 1998  New 3700 automated DNA Sequencer changed the sequencing possibility  Combined with TIGR Whole Genome Sequencing Strategy  And 64bit computing

Celera’s Sequencing / SNP Discovery Center

Celera Supercomputing Facility  Celera’s system is one of the most powerful civilian super-computing facilities in the world  Currently over 1.5 teraflop of computing power in a virtual compute farm of Compaq processors with 100 terabytes storage  Next phase a 100 teraflop computer

Sequencing reactions produce short reads (~550bp). Human Genome ~3 billion bases Sequence read ~550 bases The human genome is repeat-rich. Many short reads look identical to each other. GCATTA...GACCGT CGGATAGACATAAC CAGCAGCAGCAGCA Obstacles to Genome Sequencing

1. Mapping and Walking 2. Mapping and Clone by Clone Shotgun 3. Whole Genome Shotgun with Mate Pairs Lab-Intense (SLOW) Compute-Intense (FAST) Comparison of Sequencing Strategies

Mapping and Shotgun 1) Replicate mapped spans of DNA. Chromosome Mapped span (BAC) 35,000 2) Shear the replicates randomly and sequence the pieces. cgattc 3) Assemble reads by overlap matching. Infer the original sequence by consensus. Computed overlaps cgattc Computed sequence cgattcggattctcgattctacgaa Clone by Clone Shotgun sequencing

DNA target sample SHEAR & SIZE e.g., 10Kbp ± 8% std.dev. End Reads / Mate Pairs CLONE & END SEQUENCE & END SEQUENCE 590bp 10,000bp Mate-Pair Shotgun DNA Sequencing

– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 5’ BAC 3’ – Collect 10-15x BAC inserts and end sequence: ~ 300K pairs for Human. ~ 27 million reads for Human. 2Kbp 10Kbp  Whole Genome Shotgun Sequencing: Whole Genome Sequencing Approaches 50Kbp

Building Scaffolds Mated reads Confirmed if at least 2 join the same unitigs and one of them is a U-unitig. 1. 2k-10k Scaffolds: Compute all “unitigs” in graph of U-unitigs connected by confirmed mate links. 2. BAC Scaffolds: Compute all “unitigs” in graph of 2/10K scaffolds connected by confirmed BAC links. Scaffold Sequence or Repeat Gaps (with estimated distances) 1 in 10 15 that a confirmed pair is in error.

The Gene Counting Problem The number probably will be never known exactly. Current estimates: 30,000-40,000 Other estimates: 120,000 Gene discovery: sequence analysis motif recognition matches to mRNA computational predictions mouse data matches experimental validation

Gene Counting Random ESTs from tissues CpG islands (55% of known genes are in CpG islands) Complexity of EST data sets -- sampling biased on tissue and depth of collection Underrepresented in data bases: Low abundance genes, in inaccessible tissues or developmental stages Overrepresented: EST data sets are composed of incomplete sequences of mRNA, and non-overlapping pieces of same mRNA

Functional Assignment using Gene Ontology 13,601 Genes Drosophila

Gene Number in the Human Genome

Haemophilus vs. Drosophila  HfluDrosophilaX  Genome Size (Mbp)1.8 12067  Sequences26,0003,100,000116  Months in sequencing441  Sequencing Staff24502.1  Assembly Group Staff11010

Human Genome Sequence from 5 Humans (3 females-2 males) completed =Human sequencing started 9/8/99 =Over 39X coverage of the genome in paired plasmid reads =First Assembly announced June 26 2.9 billion bp =Published in Science, February 16, 2001

BDGP STS Order Validation Against STS-map  Scaffolds were aligned against the BDGP STS-content map  All scaffolds with spanning 2 or more STSs were checked for order discrepancies.  16 STS sites out of 2175 (.73%) were out of order, well within the estimated error rate of the STS map. 10 have been determined to be incorrect. Celera Scaffold and STS Order 2L 3R 3L 2R X 4

Components vs. GeneMap ‘99

Order & Orientation is Essential to Finding Genes Exon 1 Exon 2 Exon 3 Exon 4 Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction. Users consistently report finding genes that they can’t find elsewhere. But if contigs are not correctly put together: 1 4 3 reversed 2

Contactin-associated protein gene (CNTNAP2) Comparison of genomic DNA sequences retrieved from the public working draft and the Celera database Genomics 73, 108-112, 2001. http://www.idealibrary.com Working draft Celera

Mouse WGA Human WGA Human CSA scaffold length percent of genome coverage Scaffold Length (Mbp) % of genome Scaffold Sizes 0 5 10 15 20 25

Scaffold Length (Mbp) Mouse WGA Human WGA Human CSA % of genome Drosophila WGA Celera-only WGA Scaffold Sizes 0 20 40 60 80

All Mouse WGA2,44619,778212 265,00096.895.5 Mouse WGA2,3671,779193 242,000 All Span (Mbp)Scaffolds Gap (Mbp)Gaps 30K %100K % WGA2,847119,000261 221,00090.488.6 CSA2,90553,000252 170,00094.692.9 30K Span (Mbp)Scaffolds Gap (Mbp)Gaps WGA2,5742,507240 99,000 CSA2,7482,845224 112,000 All C-only WGA2,7816,500134 *182,00099.098.7 30K C-only WGA2,754537118 174,000 Human and Mouse Assemblies

THE Book of Life The Blueprint of Humanity The Language of God The Parts List of Humanity The Human Genome is NOT

Molecular Function of Predicted proteins

BLAST, FASTA, and SIM4 Sorin Istrail Celera Genomics

BLAST ( B asic L ocal A lignment S earch T ool )  A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide query Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.Biol. 215(3):403-10 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NAR 25(17):3389-402 (1997) (and references therein) ProgramQueryDatabase blastpproteinprotein blastnDNADNA blastx DNA (translated in 6 frames) protein tblastnprotein tblastx

The BLAST algorithm  Detect all word hits (exact, or nearly identical matches) of a given length between the two sequences  k=10 for nucleotide sequences (exact word matches)  k=3 for protein sequences (nearly identical word matches)  Extend the word hits in both directions to high-scoring gap-free segment pairs (HSPs)  retain only HSPs that score above a threshold  start from the center of the HSP (original BLAST, 1990), or from the center of a pair of HSPs located close to each other on the same diagonal (gapped BLAST, 1997)  Extend the HSPs in both directions allowing for gaps  use dynamic programming, and stop when the alignment score falls more than a threshold X below the best score yet seen  Report all statistically significant local alignments  E-value (starting with BLAST 2.0) is used to measure the statistical significance  E-value = the number of alignments with score equal to or higher than s one would expect to find by chance when searching the database

FASTA  A program for rapid alignment of pairs of protein and DNA sequences, building a local alignment from matching sequence patterns, or words  Algorithm for comparing a query to a database of sequences For each database sequence:  Identify the 10 diagonal regions having the largest number of perfect word matches of a given length  word size: k=1,2 for protein, and k=6-10 for nucleotide searches  Re-score these regions using a given scoring matrix (e.g., PAM250), and trim them to form (gap-free) maximal scoring initial regions  Join (non-overlapping) initial regions from adjacent diagonals to generate longer regions, allowing for gaps  Re-score these based on the initial regions’ scores, assessing a penalty for each joining Align the query sequence to each of the sequences in the search set having the highest overall scores Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988).

Sim4  Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a genomic sequence for that gene, allowing for introns and sequencing errors Exon 1Exon 2 Intron 5’ 3’ GTAG Exon 3 Intron GT AG genomic sequence cDNA Florea, Hartzell, Zhang, Rubin, Miller, “A computer program for aligning expressed DNA and genomic sequences”, Genome Res 8(9):967:74 (1998)

Stages and algorithmic techniques  Detect basic homology blocks  Determine gap-free matches (HSPs) using a ‘blast’-like homology search  Detect all exact word matches of length k (e.g., k=12)  Extend the word hits in both directions, by substitutions, to gap-free high-scoring segment pairs (HSPs)  Retain only HSPs scoring above a threshold  Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic programming  Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA sequence  Extend the similarity blocks using fast greedy sequence comparison algorithms  Detect new exon cores with the ‘blast’-like homology search tuned for higher sensitivity  Refine the introns  Predict the locations of splice junctions using a combined measure of the accuracy of alignment and the intensity of splice signals at the ends of each intron  Generate the spliced alignment  Align the sequences within individual exons using greedy alignment algorithms  Connect the chain of exon alignments by gaps (introns)

From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research.

Similar presentations

Presentation on theme: "From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research.

Similar presentations

Presentation on theme: "From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research."— Presentation transcript:

Similar presentations

About project

Feedback