Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

The DNA Story Germs, Genes, and Genomics 4. Heredity Genes DNA Manipulating DNA.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune
Comparative genomics: Overview & Tools Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Next Generation Sequencing, Assembly, and Alignment Methods
From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research.
1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae Article by Peter Uetz, et.al. Presented by Kerstin Obando.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding) Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Central Dogma Information storage in biological molecules DNA RNA Protein transcription translation replication.
TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg.
Genome sequencing and assembling
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University.
Comprehensive Microbial Resource Bioinformatics Visualization Workshop Owen White May 30, 2002.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
How to Build a Horse Megan Smedinghoff.
IB Bacterial Genomics - Jan Fred Sanger sequenced the first complete genomes, e.g. the 5kbp genome of the phiX174 phage in 1978, the 16kb.
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Genomics BIT 220 Chapter 21.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Denovo genome assembly and analysis
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.
Chapter 21 Eukaryotic Genome Sequences
1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The prokaryotic genome.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The Escherichia coli nucleoid.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
ORF Calling.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Microbial genomics.
Genome sequence assembly
Interpolated Markov Models for Gene Finding
Genomes and Their Evolution
What do you with a whole genome sequence?
Microbial gene identification using interpolated Markov models
Presentation transcript:

Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

1995: 1st genome (H. influenzae, TIGR) 1996: 1st eukaryote (S. cerevisiae) 2000: 29 complete microbial genomes 22 in progress at TIGR 50+ in progress worldwide 3 complete eukaryotes yeast, nematode, fruit fly 2 major projects in 2000: Human (3.3 billion bp) Arabidopsis thaliana (125 million bp) The Genomics Revolution

Organism (genome size)Reference Haemophilus influenzae (1.83 Mb)Fleischmann et al., Science 269, (1995). Mycoplasma genitalium (0.58 Mb)Fraser et al., Science 270, (1995). Methanococcus jannaschii(1.7 Mb)Bult et al., Science 273, (1996). Helicobacter pylori(1.6 Mb)Tomb et al., Nature 388, (1997). Archeoglobus fulgidus (2.1 Mb)Klenk et al., Nature 390, (1997). Borrelia burgdorferi(1.5 Mb)Fraser et al., Nature 390, (1997). Treponema pallidum(1.1 Mb)Fraser et al., Science 281, (1998). Plasmodium falciparum chr2 (1 Mb)Gardner et al., Science 282, (1998). Thermotoga maritima (1.8 Mb)Nelson et al., Nature 399, (1999). Deinococcus radiodurans(3.3 Mb)White et al., Science 286, (1999). Arabidopsis thaliana chr2 (19 Mb)Lin et al., Nature 402, (1999). Neisseria meningitidis (2.3 Mb)Tettelin et al., Science 287, (2000). Chlamydia pneumoniae (1.2 Mb)Read et al., Nucleic Acids Res 28, (2000). Chlamydia trachomatis (1.0 Mb)Read et al., Nucleic Acids Res 28, (2000). Vibrio cholerae (4.0 Mb)Heidelberg et al., Nature, in press. Mycobacterium tuberculosis(4.4 Mb)Fleischmann et al., manuscript in preparation Streptococcus pneumoniae(2.2 Mb)Tettelin et al., manuscript in preparation Caulobacter crescentus (4.0 Mb)Nierman et al., manuscript in preparation Chlorobium tepidum (2.1 Mb)Eisen et al., manuscript in preparation Porphyromonas gingivalis (2.2 Mb)Fleishmann et al., manuscript in preparation Genomes Completed at TIGR

Organism (genome size)Funding source Plasmodium falciparum chr 14 (3.4 Mb)BWF/DoD Plasmodium falciparum chr 10,11 (4 Mb)NIAID/DoD Trypanosoma brucei chr 2 (1 Mb)NIAID Enterococcus faecalis (3.0 Mb)NIAID Mycobacterium avium (4.4 Mb)NIAID Pseudomonas putida (6.2 Mb)DOE Schewanella putrefaciens (4.5 Mb)DOE Staphylococcus aureus (2.8 Mb)NIAID, MGRI Dehalococcoides ethenogenes (1.5Mb)DOE Desulfovibrio vulgaris (3.2Mb)DOE Thiobacillus ferrooxidans (2.9 Mb)DOE Chlamydia psittaci GPIC (1.2Mb)NIAID Bacillus anthracis (5.0Mb)ONR/DOE/NIAID Treponema denticola (3.0 Mb)NIDR C. hydrogenoformans (2.0 Mb)DOE Methylococcus capsulatus (4.6 Mb)DOE Geobacter sulfurreducens (4.0 Mb)DOE Wolbachia sp (Drosophila) (1.4 Mb)NIH Colwellia sp (1.0 Mb)DOE Mycobacterium smegmatis (4.0Mb)NIAID Staphylococcus epidermidis (2.5 Mb)NIAID Theileria parva (10Mb)ILRI/TIGR Genomes in progress at TIGR

A Microbial Genome Sequencing Project Random sequencingGenome AssemblyAnnotationData Release Library construction Colony picking Template preparation Sequencing reactions Base calling Sequence files TIGR Assembler Genome scaffold Ordered contig set Gap closure sequence editing Re-assembly ONE ASSEMBLY! Combinatorial PCR POMP Gene finding Homology searches Initial role assignments Metabolic pathways Gene families Comparative genomics Transcriptional/ translational regularory elements Repetitive sequences Publication Sample tracking

Gene Finding  Gene finding plays an ever-larger role in high-speed DNA sequencing projects  There’s no time for much else!  1000’s of genes generated each month at a high-throughput sequencing facility  Separate gene finders are needed for every organism  Training on organism X, finding genes on Y, generates inferior results  Bootstrapping problem: training data is hard to find

Open Reading Frames: 6 possibilities TCG TAC GTA GCT AGC TAG CTA AGC ATG CAT CGA TCG ATC GAT T CGT ACG TAG CTA GCT AGC TA A GCA TGC ATC GAT CGA TCG AT TC GTA CGT AGC TAG CTA GCT A AG CAT GCA TCG ATC GAT CGA T identical sequence

G LIMMER : A Microbial Gene Finder  G LIMMER 2.0: released late 1999  > 200 site licenses worldwide  Works on bacteria, archaea, viruses too  Malaria (eukaryotic) version: G LIMMER M  Refs: Salzberg et al., NAR, 1998, Genomics 1999; Delcher et al., NAR, 1999  Web site and code:

Uniform Markov Models  Use conditional probability of a sequence position given previous k positions in the sequence.  Fixed, k th -order model: bigger k ‘s yield better models (as long as data is sufficient).  Probability (score) of sequence s 1 s 2 s 3 … s n is:

 Advantages:  Easy to train. Count frequencies of (k+1)mers in training data.  Easy to assign a score to a sequence.  Disadvantages:  (k+1)mers can be undersampled; i.e., occur too infrequently in training data.  Models sequence as fixed-length chunks, which may not be the best model of biology. Uniform Markov Models

Interpolated Markov Models  Use a linear combination of 8 different Markov chains; for example:  c 8 P (g|atcagtta) + c 7 P (g|tcagtta) + …  + c 1 P (g|a) + c 0 P (g)  where c 0 + c 1 + c 2 + c 3 + c 4 = 1  Equivalent to interpolating the results of multiple Markov chains  Score of a sequence is the product of interpolated probabilities of bases in the sequence

IMM’s vs. Fixed-Order Models  Performance:  IMM should always do at least as well as fixed-order.  E.g., even if k th -order model is correct, it can be simulated by (k+1) st -order  Our results support this.  IMM result can be used as fixed-order model.  IMM slightly harder to train and uses more memory.

IMM Training  Problem: How to determine the weights of all the thousands of k-mers?  Traditionally done with E-M algorithm using cross-validation (deleted estimation).  Slow.  Overtraining can be a problem.

G LIMMER IMM Training  Our approach assumes:  Longer context is always better  Only reason not to use it is undersampling in training data.  If sequence occurs frequently enough in training data, use it, i.e., = 1  Otherwise, use frequency and  2 significance to set.

How G LIMMER Works  Three separate programs:  long-orfs: automatically extract long open reading frames that do not overlap other long orfs.  IMM model builder. Takes any kind of sequence data.  Gene predictor. Takes genome sequence and finds all the genes.

Gene Predictor  Finds & scores entire ORF’s.  Uses 7 competing models: 6 reading frames plus “random” model.  Score for an ORF is the probability that the “right” model generated it.  3-periodic Markov model  High-scoring ORF’s are then checked for overlaps.

Glimmer 2.0 IMM design ATGCATGATCGAG 12bp Pos -1 a c t Pos -3 Pos -2 g Pos -3 Pos -4 8 levels deep Context

Better Overlap Resolution

G LIMMER 2.0 ’s Performance Organism Genes Genes Additional Annotated Found Genes H. influenzae (99.0%)250(14%) M. genitalium483480(99.4%)81(17%) M. jannaschii (99.7%)221(13%) H. pylori (97.5%)293(18%) E. coli (97.4%)824(19%) B. subtilis (98.3%)586(14%) A. fulgidis (98.6%)274(11%) B. burgdorferi853843(99.3%)62(7%) T. pallidum (97.6%)180(17%) T. maritima (98.8%)190(10%)

G LIMMER 2.0 on known genes Organism Genes Known Correct Annotated Genes Predictions H. influenzae (99.7%) M. genitalium (99.6%) M. jannaschii (99.8%) H. pylori (99.3%) E. coli (99.1%) B. subtilis (98.6%) A. fulgidis (99.3%) B. burgdorferi (99.8%) T. pallidum (98.9%) T. maritima (99.3%) Average(99.3%)

 Speed  Training for 2 Megabase genome: < 1 minute (on a Pentium-450)  Find all genes in 2Mb genome: < 1 minute  Impact: G LIMMER was used for:  B. burgdorferi (Lyme disease), T. pallidum (syphilis) (TIGR)  C. trachomatis (blindness,std) (Berkeley/Stanford)  C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF)  T. maritima, D. radiodurans, M. tuberculosis, V. cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR)  X. fastidiosa (Brazilian consortium)  Plasmodium falciparum (malaria) [GlimmerM]  Arabidopsis thaliana (model plant) [GlimmerM]  Others: viruses, simple eukaryotes, more bacteria

Self-Similarity Scans Idea: analyze a whole genome by counting 3-mers in all 6 frames Analyze small windows (2000 bp, 10000bp) using the same statistic Algorithm: Build model of entire sequence Build model of entire sequence Apply to compare windows to the genome itself Apply the  2 statistic to compare windows to the genome itself

Haemophilus influenzae (meningitis) GC% 22

Thermotoga maritima (hyperthermophile)

Vibrio cholerae (cholera)

On the other side of CTX  prophage is a region encoding an RTX toxin (rtxA) and its activator (rtxC) and transporters (rtxBD). A third transporter gene has been identified that is a paralog of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units. --Heidelberg et al., Nature, in press.

28  Aligns 2 complete genomes  Maximal Unique Matches  Suffix trees  Very fast alignment of very long DNA sequences  Ref: Delcher et al., Nucl. Acids Res., 1999  Software at: MUMmer

Efficiently compute alignments between long sequences to identify biologically interesting features. E.g., two strains of M. tuberculosis, each ~4.4MB E.g., two versions of a genome at different stages of closure Compute alignment in less than 2 minutes The Problem

Sequences in genomes A and B that: Occur exactly once in A and in B Are not contained in any larger such sequence Maximal Unique Sequences

Select the longest consistent set of MUMs Occur in the same order in A and B

 A tree with edges labelled by strings  Labels of child edges of a node begin with distinct letters  Each leaf L represents a sequence—the labels on the path to L from the root  Holds all suffixes of a set of sequences  A suffix is a subsequence that extends to the end of its sequence  The suffix tree for sequences A and B :  Contains less than 2(|A | + |B |) nodes.  Can be constructed in O (|A | + |B |) time!  Still need lots of RAM  All the analyses here were run on a desktop PC Suffix Trees

 A nalyze the gaps between adjacent MUMs  Small gaps can be aligned with Smith-Waterman algorithm  Large gaps can be aligned recursively  Large inserts can be searched for separately. Many will be inconsistent MUMs  Overlapping MUMs indicate variation in copy number of small repeats

M. tuberculosis CSU93 vs. H37Rv ACGT A C G T a MUM

M genitalium vs. M. pneumoniae

H. pylori vs. J99

V. cholera (forward) vs. E. coli Origin

V. cholera (reverse) vs. E. coli

V. cholera (both strands) vs. E. coli: a puzzle?

V. cholera vs. itself

S. pyogenes vs. S. pneumoniae

S. pyogenes vs. itself

M. leprae vs M. tuberculosis M. leprae M. tuberculosis

X-alignments: how? Ori

Chr 2 vs. Chr 4 of Arabidopsis thaliana: discovery of a 4 Mb duplication 1100 genes 430 (39%) duplicated

46 Acknowledgements G LIMMER, G LIMMER M Arthur Delcher, Simon Kasif, Owen White, Mihaela Pertea MUMmer Arthur Delcher, Simon Kasif, Jeremy Peterson, Rob Fleischmann, Owen White Analyses Numerous TIGR faculty and staff, including: Jonathan Eisen, Owen White, Rob Fleischmann, Hervé Tettelin, Tim Read, Maria Ermolaeva, John Heidelberg, Ian Paulsen, Malcolm Gardner, Claire Fraser, Clyde Hutchison,... Supported by: National Institutes of Health (NHGRI, NLM) National Science Foundation (CISE, BIO)