Presentation is loading. Please wait.

Presentation is loading. Please wait.

15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano.

Similar presentations


Presentation on theme: "15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano."— Presentation transcript:

1 15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano Pesole* Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy *Department of Physiology and Biochemistry, University of Milan, Italy Supported by FIRB Bioinformatics: Genomics and Proteomics

2 15-20 september WABI032 Outline Gene structure and alternative splicing (AS) Problem definition and algorithm ASPic program Experimental results and discussion

3 15-20 september WABI033 Mechanism of Splicing 3’5’ 3’ DNA TRANSCRIPTION 5’ 3’ exon 1exon 2exon 3 pre-mRNA SPLICING by spliceosome exon 1exon 2exon 3 splicing product mRNA EST Expressed Sequence Tag (cDNA) exon 2 exon 1exon 3

4 15-20 september WABI034 Modes of Alternative Splicing 123 Genomic sequence 123 Exons Introns 123 First splicing mode Second splicing mode 13 Third splicing mode 23

5 15-20 september WABI035 Modes of Alternative Splicing 123 2b Competing 5’–3’ Exclusive exons: 1 31 2b

6 15-20 september WABI036 Why AS is important? AS occurs in 59% of human genes (Graveley, 2001) AS expands protein diversity (generates from a single gene multiple transcripts) AS is tissue-specific (Graveley, 2001) AS is related to human diseases

7 15-20 september WABI037 Motivations predict alternative splicing forms analyze such a mechanism by a representation of splicing forms Regulation of AS is still an open problem NEED tools to

8 15-20 september WABI038 What is available? Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001) Squall (Ogasawara & Morishita, 2002) But to predict the exon-intron gene structure is a complicate goal because of sequencing errors in EST make difficult to locate splice sites by alignment duplications, repeated sequences may produce more than one possible EST alignment

9 15-20 september WABI039 Open Problems Formal definition of AS prediction problem … Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure Optimization criteria

10 15-20 september WABI0310 Formal Definitions Def 1 Genomic sequence, G = I 1 f 1 I 2 f 2 I 3 f 3 … I n f n I n+1, where I i (i=1, 2, …, n+1) are introns and f i (i=1, 2, …, n) are exons Def 2 Exon factorization of G, G E = f 1 f 2 f 3 … f n Def 3 EST factorization of an EST S compatible with G E is S=s 1 s 2 … s k s.t. there exists 1  i1 < i2 < … < ik  n: s t = f it for t=2, 3, …, k-1 s 1 is a suffix of f i1 and s k is a prefix of f ik s t = suff (f it ) or s t = pref (f it ) splice variant Def 1 Genomic sequence, G = I 1 f 1 I 2 f 2 I 3 f 3 … I n f n I n+1, where I i (i=1, 2, …, n+1) are introns and f i (i=1, 2, …, n) are exons Def 2 Exon factorization of G, G E = f 1 f 2 f 3 … f n Def 3 EST factorization of an EST S compatible with G E is S=s 1 s 2 … s k s.t. there exists 1  i1 < i2 < … < ik  n: edit (s t, f it )  error for t=2, 3, …, k-1 edit(s 1, suff(f i1 ))  error and edit(s k, pref(f ik ))  error

11 15-20 september WABI0311 The Problem Input - A genomic sequence G - A set of EST sequences S = {S 1, S 2, …, S n } Output An exon factorization G E of G (G E = f 1, f 2, …, f n ) and a set of ESTs factorizations compatible with G E Objective: minimize n

12 15-20 september WABI0312 Example Genomic sequence G EST set S = {S 1, S 2, S 3 } S2S2 A1A2A1A2 BD1D1 S3S3 A2A2 D1D2D1D2 C1C2C1C2 A2A2 A1A2A1A2 BD1D1 C1C1 D1D2D1D2 C1C2C1C2 C1C1 S1S1 A2A2 D1D1 A2A2 D1D1 C1C1 A2A2 D1D1 C1C1 A1A2A1A2 BD1D1 A1A2A1A2 BD1D1 A2A2 D1D2D1D2 C1C2C1C2 A2A2 D1D2D1D2 C1C2C1C2 7 exons BD1D2D1D2 C1C2C1C2 4 exons A1A2A1A2

13 15-20 september WABI0313 Results MEFC is MAX-SNP-hard (linear reduction from NODE-COVER) heuristic algorithm: Iterate process to factorize each EST backtracking to recompute previous EST factors if not compatible to G E

14 15-20 september WABI0314 The algorithm s i1 s i j-1 s ij SiSi e1e1 e2e2 G Iterative j th step: partial EST factorization of S i (compute factor s ij ) emem if (Compatible(e m, exon_list)) then add e m to exon_list; otherwise try to place s ij elsewhere; emem If not possible then backtrack; s i-1 1 s i-1 j-1 s i-1 j s i-1 n S i-1 After placing all the factors s ij for the set S, place the external factors;

15 15-20 september WABI0315 The algorithm (more details) G s i1 s i j-1 SiSi s i j Compute factor s ij S ij can be divided into n components c k (k=1,2,…,n) At least one of these components for k from 1 to (n-1) is error-free and can be placed on G s ij c1c1 c2c2 c3c3 c4c4 c5c5 The algorithm searches a perfect match of c 1 on G c1c1 Suppose that c 1 has no perfect match on G Then the algorithm searches a perfect match of c 2 on G c2c2 c1c1 c1c1 Suppose that c 2 has a perfect match on G c2c2 Then the entire factor s ij can be placed on GFind the canonical ag pattern on the left ag Find the rightmost gt pattern such that the edit distance between s ij y and the genomic substring from ag to gt is bounded gt s i j y exon

16 15-20 september WABI0316 ASPic (Alternative Splicing PredICtion) Input - A minimum length of an exon - A maximum number of exons in the exon factorization of the genomic sequence - An error percentage - A genomic sequence - An ESTs set (or cluster) Output - A text file for all ESTs alignments - An HTML file for the exon factorization of the genomic sequence

17 15-20 september WABI0317 ASPic data validation ASAP (Lee et al., 2003) Genomic sequences from ASAP database EST clusters of human chromosome 1 from UniGene database ASPic INPUT: Validation Database:

18 15-20 september WABI0318 Experimental Results Genomic sequence (official gene name) Introns detected by ASAP ASAP introns detected by ASPic Novel introns detected by ASPic Genomic shift detected by ASPic

19 15-20 september WABI0319 Execution times PENTIUM IV, 1600 MHZ, 256 MB, running Linux

20 15-20 september WABI0320 An example of data (gene HNRPR) ASPic finds a novel intron from 2144 to 5333 confirmed by 18 EST sequences Positions are from 0 for ASPic and from 1 for ASAP

21 15-20 september WABI0321 An example of data (gene HNRPR, intron 2144-5333) EST ID Left and right ends of the two exons EST exonsGenomic exons

22 15-20 september WABI0322 WEB site

23 15-20 september WABI0323 WEB site

24 15-20 september WABI0324 WEB site

25 15-20 september WABI0325 Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole Responsabile disegno software: Raffaella Rizzi Sito WEB:Gabriele Ravanelli Rappresentazione grafica:Francesco Perego Anna Redondi Analisi dati:Francesca Rossin Altri contributi:Gianluca Dellavedova

26 15-20 september WABI0326 GRAZIE!


Download ppt "15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano."

Similar presentations


Ads by Google