Download presentation
Presentation is loading. Please wait.
2
From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research
3
Stan Ulam’s Vision “Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”
4
The Gene Counting Problem “The more I see the less I know for sure.” John Lennon
5
The O.J. Simpson Problem No matter how good the data look, it is full of errors! David Botstein, Albuquerque, 1994
6
Science and Technology Genomics Comparative Genomics Proteomics Pharmacogenomics Structural Genomics Drug and Vaccine Design DNA Expression Chips Animal Models
7
GENOMICS
8
There is Nothing More Important than the Assembly! Gene Myers and Granger Sutton and the Assembly Team Celera Human Assembly: Largest Non-Defence Computation 20,000 CPU hours -- done 5 times now 160 Processors -- Compaq Architecture
9
Assembly Progression (Macro View)
10
Generally 3-10 pairs link each consecutive contig
12
Archaeoglobus fulgidus Methanobacterium thermoautotropicum Saccharomyces cerevisiae Mycoplasma pneumoniae 1996 19981997 Methanococcus jannaschii Mycoplasma genitalium Haemophilus influenzae Completed Microbial Genomes Treponema pallidum Borrelia burgdorferi Helicobacter pylori Bacillus subtilis Escherichia coli Aquifex aeolicus 1999 Aq Mycobacterium tuberculosis H37Rv Pyrococcus horikoshii (Mycobacterium tuberculosis CSU#93) (Deinococcus radiodurans) (Thermotoga maritima) (Rickettsia prowazekii) Chlamydia trachomatis Synechocystis sp.
13
GENOMES SEQUENCED AT TIGR and CELERA Pathogens *Haemophilus influenzae Rd *Mycoplasma genitalium *Helicobacter pylori *Borrelia burgdorferi *Treponema pallidum *Plasmodium flaciparum *Neisseria meningitidis *Chlamydia trachomatis *Chlamydia pneumoniae *Vibro cholerae *Streptococcus pneumoniae *Mycobacterium tuberculosis *Porphyromonas gingivalis *Trypanosoma brucei *Staphylococcus aureus *Enterococcus faecalis *Porphyromonas gingivalis *Chlamydia psittaci Plants *Arabidopsis thaliana Environment *Methanococcus jannaschii *Archaeoglobus fulgidus *Thermotoga maritima *Deinococcus radiodurans *Chlorobium tepidum *Caulobacter crescentus *Shewanella putrafaciens *Desulfovibrio vulgaris *Pseudomonas putida Insects **Drosophila melanogaster Mammals **Human **Mouse * The Institute for Genomic Research ** Celera Genomics
14
Genesis of Celera August 1998 New 3700 automated DNA Sequencer changed the sequencing possibility Combined with TIGR Whole Genome Sequencing Strategy And 64bit computing
15
Celera’s Sequencing / SNP Discovery Center
16
Celera Supercomputing Facility Celera’s system is one of the most powerful civilian super-computing facilities in the world Currently over 1.5 teraflop of computing power in a virtual compute farm of Compaq processors with 100 terabytes storage Next phase a 100 teraflop computer
19
Sequencing reactions produce short reads (~550bp). Human Genome ~3 billion bases Sequence read ~550 bases The human genome is repeat-rich. Many short reads look identical to each other. GCATTA...GACCGT CGGATAGACATAAC CAGCAGCAGCAGCA Obstacles to Genome Sequencing
20
1. Mapping and Walking 2. Mapping and Clone by Clone Shotgun 3. Whole Genome Shotgun with Mate Pairs Lab-Intense (SLOW) Compute-Intense (FAST) Comparison of Sequencing Strategies
21
Mapping and Shotgun 1) Replicate mapped spans of DNA. Chromosome Mapped span (BAC) 35,000 2) Shear the replicates randomly and sequence the pieces. cgattc 3) Assemble reads by overlap matching. Infer the original sequence by consensus. Computed overlaps cgattc Computed sequence cgattcggattctcgattctacgaa Clone by Clone Shotgun sequencing
22
DNA target sample SHEAR & SIZE e.g., 10Kbp ± 8% std.dev. End Reads / Mate Pairs CLONE & END SEQUENCE & END SEQUENCE 590bp 10,000bp Mate-Pair Shotgun DNA Sequencing
23
– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 5’ BAC 3’ – Collect 10-15x BAC inserts and end sequence: ~ 300K pairs for Human. ~ 27 million reads for Human. 2Kbp 10Kbp Whole Genome Shotgun Sequencing: Whole Genome Sequencing Approaches 50Kbp
24
Building Scaffolds Mated reads Confirmed if at least 2 join the same unitigs and one of them is a U-unitig. 1. 2k-10k Scaffolds: Compute all “unitigs” in graph of U-unitigs connected by confirmed mate links. 2. BAC Scaffolds: Compute all “unitigs” in graph of 2/10K scaffolds connected by confirmed BAC links. Scaffold Sequence or Repeat Gaps (with estimated distances) 1 in 10 15 that a confirmed pair is in error.
25
The Gene Counting Problem The number probably will be never known exactly. Current estimates: 30,000-40,000 Other estimates: 120,000 Gene discovery: sequence analysis motif recognition matches to mRNA computational predictions mouse data matches experimental validation
26
Gene Counting Random ESTs from tissues CpG islands (55% of known genes are in CpG islands) Complexity of EST data sets -- sampling biased on tissue and depth of collection Underrepresented in data bases: Low abundance genes, in inaccessible tissues or developmental stages Overrepresented: EST data sets are composed of incomplete sequences of mRNA, and non-overlapping pieces of same mRNA
27
Functional Assignment using Gene Ontology 13,601 Genes Drosophila
28
Gene Number in the Human Genome
29
Haemophilus vs. Drosophila HfluDrosophilaX Genome Size (Mbp)1.8 12067 Sequences26,0003,100,000116 Months in sequencing441 Sequencing Staff24502.1 Assembly Group Staff11010
31
Human Genome Sequence from 5 Humans (3 females-2 males) completed =Human sequencing started 9/8/99 =Over 39X coverage of the genome in paired plasmid reads =First Assembly announced June 26 2.9 billion bp =Published in Science, February 16, 2001
32
BDGP STS Order Validation Against STS-map Scaffolds were aligned against the BDGP STS-content map All scaffolds with spanning 2 or more STSs were checked for order discrepancies. 16 STS sites out of 2175 (.73%) were out of order, well within the estimated error rate of the STS map. 10 have been determined to be incorrect. Celera Scaffold and STS Order 2L 3R 3L 2R X 4
33
Components vs. GeneMap ‘99
34
Order & Orientation is Essential to Finding Genes Exon 1 Exon 2 Exon 3 Exon 4 Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction. Users consistently report finding genes that they can’t find elsewhere. But if contigs are not correctly put together: 1 4 3 reversed 2
35
Contactin-associated protein gene (CNTNAP2) Comparison of genomic DNA sequences retrieved from the public working draft and the Celera database Genomics 73, 108-112, 2001. http://www.idealibrary.com Working draft Celera
36
Mouse WGA Human WGA Human CSA scaffold length percent of genome coverage Scaffold Length (Mbp) % of genome Scaffold Sizes 0 5 10 15 20 25
37
Scaffold Length (Mbp) Mouse WGA Human WGA Human CSA % of genome Drosophila WGA Celera-only WGA Scaffold Sizes 0 20 40 60 80
38
All Mouse WGA2,44619,778212 265,00096.895.5 Mouse WGA2,3671,779193 242,000 All Span (Mbp)Scaffolds Gap (Mbp)Gaps 30K %100K % WGA2,847119,000261 221,00090.488.6 CSA2,90553,000252 170,00094.692.9 30K Span (Mbp)Scaffolds Gap (Mbp)Gaps WGA2,5742,507240 99,000 CSA2,7482,845224 112,000 All C-only WGA2,7816,500134 *182,00099.098.7 30K C-only WGA2,754537118 174,000 Human and Mouse Assemblies
39
THE Book of Life The Blueprint of Humanity The Language of God The Parts List of Humanity The Human Genome is NOT
41
ADRB2
43
Molecular Function of Predicted proteins
45
BLAST, FASTA, and SIM4 Sorin Istrail Celera Genomics
46
BLAST ( B asic L ocal A lignment S earch T ool ) A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide query Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.Biol. 215(3):403-10 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NAR 25(17):3389-402 (1997) (and references therein) ProgramQueryDatabase blastpproteinprotein blastnDNADNA blastx DNA (translated in 6 frames) protein tblastnprotein tblastx
47
The BLAST algorithm Detect all word hits (exact, or nearly identical matches) of a given length between the two sequences k=10 for nucleotide sequences (exact word matches) k=3 for protein sequences (nearly identical word matches) Extend the word hits in both directions to high-scoring gap-free segment pairs (HSPs) retain only HSPs that score above a threshold start from the center of the HSP (original BLAST, 1990), or from the center of a pair of HSPs located close to each other on the same diagonal (gapped BLAST, 1997) Extend the HSPs in both directions allowing for gaps use dynamic programming, and stop when the alignment score falls more than a threshold X below the best score yet seen Report all statistically significant local alignments E-value (starting with BLAST 2.0) is used to measure the statistical significance E-value = the number of alignments with score equal to or higher than s one would expect to find by chance when searching the database
48
FASTA A program for rapid alignment of pairs of protein and DNA sequences, building a local alignment from matching sequence patterns, or words Algorithm for comparing a query to a database of sequences For each database sequence: Identify the 10 diagonal regions having the largest number of perfect word matches of a given length word size: k=1,2 for protein, and k=6-10 for nucleotide searches Re-score these regions using a given scoring matrix (e.g., PAM250), and trim them to form (gap-free) maximal scoring initial regions Join (non-overlapping) initial regions from adjacent diagonals to generate longer regions, allowing for gaps Re-score these based on the initial regions’ scores, assessing a penalty for each joining Align the query sequence to each of the sequences in the search set having the highest overall scores Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988).
49
Sim4 Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a genomic sequence for that gene, allowing for introns and sequencing errors Exon 1Exon 2 Intron 5’ 3’ GTAG Exon 3 Intron GT AG genomic sequence cDNA Florea, Hartzell, Zhang, Rubin, Miller, “A computer program for aligning expressed DNA and genomic sequences”, Genome Res 8(9):967:74 (1998)
50
Stages and algorithmic techniques Detect basic homology blocks Determine gap-free matches (HSPs) using a ‘blast’-like homology search Detect all exact word matches of length k (e.g., k=12) Extend the word hits in both directions, by substitutions, to gap-free high-scoring segment pairs (HSPs) Retain only HSPs scoring above a threshold Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic programming Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA sequence Extend the similarity blocks using fast greedy sequence comparison algorithms Detect new exon cores with the ‘blast’-like homology search tuned for higher sensitivity Refine the introns Predict the locations of splice junctions using a combined measure of the accuracy of alignment and the intensity of splice signals at the ends of each intron Generate the spliced alignment Align the sequences within individual exons using greedy alignment algorithms Connect the chain of exon alignments by gaps (introns)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.