Download presentation
Presentation is loading. Please wait.
Published byGwenda Greene Modified over 9 years ago
1
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htmhttp://www.bioalgorithms.info/slides.htm
2
The Gene Prediction Problem Given genome sequences, determine where are the genes The problem is easier for prokaryotes (no introns) The problem is significantly harder for eukaryotes (alternative splicing)
3
Splicing Causes Problem…
4
Exons vs. Introns Exon: A portion of the gene that appears in both the primary and the mature mRNA transcripts. Intron: A portion of the gene that is transcribed but excised prior to translation.
5
Definition of a Gene Regulatory regions: up to 50 kb upstream of +1 site Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) Introns:splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron Gene size:Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
6
Different Views of a Gene Exons Introns e1 e2e3 e1 e2e3 MSRTAQ… Pre-mRNA mRNA Protein DNA ATGCTTGCCAAAT…TCG… Gene
7
Approaches to Gene Prediction Similarity-based approaches: –Exploit the fact that many genes are conserved across species –Can be highly reliable –Only good for finding unknown genes Statistical approaches –Exploit statistical characteristics of coding regions and non- coding regions and other knowledge about genes –Can potentially detect new genes –May not be reliable They can/should be combined –Currently no principled approaches for doing this
8
Outline The idea of similarity-based approach to gene prediction Exon Chaining Problem Spliced Alignment Problem
9
Using Known Genes to Predict New Genes Some organism’s genome may be very well- documented, with many genes having been experimentally verified. Closely-related organisms may have similar genes Unknown genes in one species may be compared to genes in some closely-related species
10
Comparing Genes in Two Genomes Small islands of similarity corresponding to similarities between exons
11
Reverse Translation Given a known protein, find a gene in the genome which codes for it One might infer the coding DNA of the given protein by reversing the translation process –Inexact: amino acids map to > 1 codon –This problem is essentially reduced to an alignment problem
12
Comparing Genomic DNA Against mRNA Portion of genome mRNA (codon sequence) exon3exon1exon2 {{{ intron1intron2 {{
13
Using Similarities to Find the Exon Structure The known frog gene is aligned to different locations in the human genome Find the “best” path to reveal the exon structure of human gene Frog Gene (known) Human Genome
14
Finding Local Alignments Use local alignments to find all islands of similarity Human Genome Frog Genes (known)
15
Chaining Local Alignments Find substrings that match a given gene sequence (candidate exons) Define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) Look for a maximum chain of substrings –Chain: a set of non-overlapping nonadjacent intervals.
16
Exon Chaining Problem Locate the beginning and end of each interval (2n points) Find the “best” path 3 4 11 9 15 5 5 02356111316202527283032
17
Exon Chaining Problem: Formulation Exon Chaining Problem: Given a set of putative exons, find a maximum set of non- overlapping putative exons Input: a set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set
18
Exon Chaining: Graph Representation This problem can be solved with dynamic programming in O(n) time.
19
Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals 1 for i ← to 2n 2 s i ← 0 3 for i ← 1 to 2n 4 if vertex v i in G corresponds to right end of interval I 5 j ← index of vertex for left end of the interval I 6 w ← weight of the interval I 7 s j ← max {s j + w, s i-1 } 8 else 9 s i ← s i-1 10 return s 2n
20
Exon Chaining: Deficiencies –Poor definition of the putative exon endpoints –Optimal chain of intervals may not correspond to any valid alignment First interval may correspond to a suffix, whereas second interval may correspond to a prefix Combination of such intervals is not a valid alignment
21
Spliced Alignment Proposed in 1996 by Mikhail Gelfand and colleagues Goal: Use a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Method –Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem) –Find a chain of putative exons that has the highest similarity to the target protein
22
Spliced Alignment Problem: Formulation Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence Input: Genomic sequences G, target sequence T, and set of candidate exons B. Output: A chain of exons Γ such that the global alignment score s(Γ*, T) is maximum among all chains of blocks from B. Γ* is the string formed by concatenating strings in Γ. Essentially an alignment problem…
23
Lewis Carroll Example
24
The solution to the sliced alignment problem will be discussed later when we talk about sequence alignment…
25
What You Should Know Why splicing causes difficulty in gene prediction The formulation and algorithm for Exon Chaining Why Spliced Alignment is a better formulation than Exon Chaining
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.