An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Dynamic Programming: Sequence alignment
Hidden Markov Model.
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Sequence Alignment Tutorial #2
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Dynamic Programming II
Sequencing a genome and Basic Sequence Alignment
Hidden Markov Models In BioInformatics
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Regular Expression ^ beginning of string $ end of string. any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times;
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Sequencing a genome and Basic Sequence Alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
(H)MMs in gene prediction and similarity searches.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Finding
Ab initio gene prediction
Reference based assembly
Sequence Alignment Using Dynamic Programming
Introduction to Bioinformatics II
Pair Hidden Markov Model
Dynamic Programming (cont’d)
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Outline 1.Introduction 2.Exon Chaining Problem 3.Spliced Alignment 4.Gene Prediction Tools

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Section 1: Introduction

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Similarity-Based Approach to Gene Prediction Some genomes may be well-studied, with many genes having been experimentally verified. Closely-related organisms may have similar genes. In order to determine the functions of unknown genes, researchers may compare them to known genes in a closely- related species. This is the idea behind the similarity-based approach to gene prediction.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Similarity-Based Approach to Gene Prediction There is one issue with comparison: genes are often “split.” The exons, or coding sections of a gene, are separated from introns, or the non-coding sections. Example: Our Problem: Given a known gene and an unannotated genome sequence, find a set of substrings in the genomic sequence whose concatenation best matches the known gene. This concatenation will be our candidate gene. …AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT… CodingNon-codingCoding Non-coding

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Frog Gene (known) Human Genome Using Similarities to Find Exon Structure A (known) frog gene is aligned to different locations in the human genome. Find the “best” path to reveal the exon structure of the corresponding human gene.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Finding Local Alignments A (known) frog gene is aligned to different locations in the human genome. Find the “best” path to reveal the exon structure of the corresponding human gene. Use local alignments to find all islands of similarity. Frog Gene (known) Human Genome

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info genome mRNA Exon 3Exon 1Exon 2 { {{ Intron 1Intron 2 { { Reverse Translation Reverse Translation Problem: Given a known protein, find a gene which codes for it. Inexact: amino acids map to > 1 codon This problem is essentially reduced to an alignment problem. Example: Comparing Genomic DNA Against mRNA.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Reverse Translation The reverse translation problem can be modeled as traveling in Manhattan grid with “free” horizontal jumps. Each horizontal jump models insertion of an intron. Complexity of Manhattan grid is O(n 3 ). Issue: Would match nucleotides pointwise and use horizontal jumps at every opportunity.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Section 2: Exon Chaining Problem

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Chaining Local Alignments Aim: Find candidate exons, or substrings that match a given gene sequence. Define a candidate exon as (l, r, w): l = starting position of exon r = ending position of exon w = weight of exon, defined as score of local alignment or some other weighting score Idea: Look for a chain of substrings with maximal score. Chain: a set of non-overlapping nonadjacent intervals.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Locate the beginning and end of each interval (2n points). Find the “best” concatenation of some of the intervals Exon Chaining Problem: Illustration

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Locate the beginning and end of each interval (2n points). Find the “best” concatenation of some of the intervals Exon Chaining Problem: Illustration

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Locate the beginning and end of each interval (2n points). Find the “best” concatenation of some of the intervals Exon Chaining Problem: Illustration

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Exon Chaining Problem: Formulation Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). Input: A set of weighted intervals (putative exons). Output: A maximum chain of intervals from this set.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Exon Chaining Problem: Formulation Goal: Given a set of putative exons, find a maximum set of non-overlapping putative exons (chain). Input: A set of weighted intervals (putative exons). Output: A maximum chain of intervals from this set. Question: Would a greedy algorithm solve this problem?

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info This problem can be solved with dynamic programming in O(n) time. Idea: Connect adjacent endpoints with weight zero edges, connect ends of weight w exon with weight w edge. Exon Chaining Problem: Graph Representation

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info ExonChaining (G, n) //Graph, number of intervals 1.for i ← 1 to 2n 2. s i ← 0 3.for i ← 1 to 2n 4. if vertex v i in G corresponds to right end of the interval I 5. j ← index of vertex for left end of the interval I 6. w ← weight of the interval I 7. s j ← max {s j + w, s i-1 } 8.else 9. s i ← s i-1 10.return s 2n Exon Chaining Problem: Pseudocode

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Exon Chaining: Deficiencies 1.Poor definition of the putative exon endpoints. 2.Optimal chain of intervals may not correspond to a valid alignment. Example: First interval may correspond to a suffix, whereas second interval may correspond to a prefix.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Human Genome Frog Genes (known) Infeasible Chains: Illustration Red local similarities form two non-overlapping intervals but do not form a valid global alignment.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The cell carries DNA as a blueprint for producing proteins, like a manufacturer carries a blueprint for producing a car. Gene Prediction Analogy

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction Analogy: Using Blueprint Each protein has its own distinct blueprint for construction.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction Analogy: Assembling Exons

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction Analogy: Still Assembling…

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Section 3: Spliced Alignment

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein from one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Spliced alignment begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). This set is further filtered in a such a way that attempts to retain all true exons, with some false ones.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment Problem: Formulation Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence. Input: Genomic sequences G, target sequence T, and a set of candidate exons B. Output: A chain of exons C such that the global alignment score between C* and T is maximum among all chains of blocks from B. Here C* is the concatenation of all exons from chain C.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky” Genomic Sequence: “It was a brilliant thrilling morning and the slimy, hellish, lithe doves gyrated and gamboled nimbly in the waves.” Target Sequence: “Twas brillig, and the slithy toves did gyre and gimble in the wabe.” Alignment: Next Slide…

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky”

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky”

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky”

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky”

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Example: Lewis Carroll’s “Jabberwocky”

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Idea Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) But what is “i-prefix” of G? There may be a few i-prefixes of G, depending on which block B we are in.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Idea Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) But what is “i-prefix” of G? There may be a few i-prefixes of G, depending on which block B we are in. Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i: S(i, j, B).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment Recurrence If i is not the starting vertex of block B: If i is the starting vertex of block B: where p is some indel penalty and δ is a scoring matrix.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment Recurrence After computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Complications Considering multiple i-prefixes leads to slow down. Running time: where m is the target length, n is the genomic sequence length and |B| is the number of blocks. A mosaic effect: short exons are easily combined to fit any target protein.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Speedup

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Speedup

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Spliced Alignment: Speedup

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Exon Chaining vs Spliced Alignment In Spliced Alignment, every path spells out the string obtained by concatenation of labels of its edges. The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence. Defines weight of entire path in graph, but not the weights for individual edges. Exon Chaining assumes the positions and weights of exons are pre-defined.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Section 4: Gene Prediction Tools

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Aligning Genome vs. Genome Goal: Align entire human and mouse genomes. Predict genes in both sequences simultaneously as chains of aligned blocks (exons). This approach does not assume any annotation of either human or mouse genes.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction Tools 1.GENSCAN/Genome Scan 2.TwinScan 3.Glimmer 4.GenMark

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info The GENSCAN Algorithm Algorithm is based on probabilistic model of gene structure. GENSCAN uses a “training set,” then the algorithm returns the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc). Benefit: Covers cases where input sequence contains no gene, partial gene, complete gene, or multiple genes.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info GENSCAN Limitations Does not use similarity search to predict genes. Does not address alternative splicing. Could combine two exons from consecutive genes together.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info GenomeScan Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity information. Algorithm is a combination of two sources of information: 1.Probabilistic models of exons-introns. 2.Sequence similarity information.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info TwinScan Aligns two sequences and marks each base as gap ( - ), mismatch (:), or match (|). Results in a new alphabet of 12 letters Σ = {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. Then run Viterbi algorithm using emissions e k (b) where b ∊ {A-, A:, A|, …, T|}.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info TwinScan The emission probabilities are estimated from human/mouse gene pairs. Example: e I (x|) e E (x-) since gaps (as well as mismatches) are favored in introns. Benefit: Compensates for dominant occurrence of poly-A region in introns.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Glimmer Glimmer: Stands for Gene Locator and Interpolated Markov ModelER Finds genes in bacterial DNA Uses interpolated Markov Models

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Glimmer Made of 2 programs: 1.BuildIMM: Takes sequences as input and outputs the Interpolated Markov Models (IMMs). 2.Glimmer: Takes IMMs and outputs all candidate genes. Automatically resolves overlapping genes by choosing one, hence limited. Marks “suspected to truly overlap” genes for closer inspection by user.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info GenMark Based on non-stationary Markov chain models. Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence.