15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano.

Slides:



Advertisements
Similar presentations
A very short introduction (in plants)
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Two short pieces MicroRNA Alternative splicing.
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Bioinformatics and Phylogenetic Analysis
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Anum kamal(BB ) Umm-e-Habiba(BB ). Gene splicing “Gene splicing is the removal of introns from the primary trascript of a discontinuous gene.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Fine Structure and Analysis of Eukaryotic Genes
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
EB3233 Bioinformatics Introduction to Bioinformatics.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
Research about Alternative Splicing recently 楊佳熒.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Finding genes in the genome
Primer on Reading Frames and Phase Wilson Leung08/2012.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Annotating The data.
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CSE 589 Applied Algorithms Spring 1999
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano Pesole* Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy *Department of Physiology and Biochemistry, University of Milan, Italy Supported by FIRB Bioinformatics: Genomics and Proteomics

15-20 september WABI032 Outline Gene structure and alternative splicing (AS) Problem definition and algorithm ASPic program Experimental results and discussion

15-20 september WABI033 Mechanism of Splicing 3’5’ 3’ DNA TRANSCRIPTION 5’ 3’ exon 1exon 2exon 3 pre-mRNA SPLICING by spliceosome exon 1exon 2exon 3 splicing product mRNA EST Expressed Sequence Tag (cDNA) exon 2 exon 1exon 3

15-20 september WABI034 Modes of Alternative Splicing 123 Genomic sequence 123 Exons Introns 123 First splicing mode Second splicing mode 13 Third splicing mode 23

15-20 september WABI035 Modes of Alternative Splicing 123 2b Competing 5’–3’ Exclusive exons: b

15-20 september WABI036 Why AS is important? AS occurs in 59% of human genes (Graveley, 2001) AS expands protein diversity (generates from a single gene multiple transcripts) AS is tissue-specific (Graveley, 2001) AS is related to human diseases

15-20 september WABI037 Motivations predict alternative splicing forms analyze such a mechanism by a representation of splicing forms Regulation of AS is still an open problem NEED tools to

15-20 september WABI038 What is available? Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001) Squall (Ogasawara & Morishita, 2002) But to predict the exon-intron gene structure is a complicate goal because of sequencing errors in EST make difficult to locate splice sites by alignment duplications, repeated sequences may produce more than one possible EST alignment

15-20 september WABI039 Open Problems Formal definition of AS prediction problem … Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure Optimization criteria

15-20 september WABI0310 Formal Definitions Def 1 Genomic sequence, G = I 1 f 1 I 2 f 2 I 3 f 3 … I n f n I n+1, where I i (i=1, 2, …, n+1) are introns and f i (i=1, 2, …, n) are exons Def 2 Exon factorization of G, G E = f 1 f 2 f 3 … f n Def 3 EST factorization of an EST S compatible with G E is S=s 1 s 2 … s k s.t. there exists 1  i1 < i2 < … < ik  n: s t = f it for t=2, 3, …, k-1 s 1 is a suffix of f i1 and s k is a prefix of f ik s t = suff (f it ) or s t = pref (f it ) splice variant Def 1 Genomic sequence, G = I 1 f 1 I 2 f 2 I 3 f 3 … I n f n I n+1, where I i (i=1, 2, …, n+1) are introns and f i (i=1, 2, …, n) are exons Def 2 Exon factorization of G, G E = f 1 f 2 f 3 … f n Def 3 EST factorization of an EST S compatible with G E is S=s 1 s 2 … s k s.t. there exists 1  i1 < i2 < … < ik  n: edit (s t, f it )  error for t=2, 3, …, k-1 edit(s 1, suff(f i1 ))  error and edit(s k, pref(f ik ))  error

15-20 september WABI0311 The Problem Input - A genomic sequence G - A set of EST sequences S = {S 1, S 2, …, S n } Output An exon factorization G E of G (G E = f 1, f 2, …, f n ) and a set of ESTs factorizations compatible with G E Objective: minimize n

15-20 september WABI0312 Example Genomic sequence G EST set S = {S 1, S 2, S 3 } S2S2 A1A2A1A2 BD1D1 S3S3 A2A2 D1D2D1D2 C1C2C1C2 A2A2 A1A2A1A2 BD1D1 C1C1 D1D2D1D2 C1C2C1C2 C1C1 S1S1 A2A2 D1D1 A2A2 D1D1 C1C1 A2A2 D1D1 C1C1 A1A2A1A2 BD1D1 A1A2A1A2 BD1D1 A2A2 D1D2D1D2 C1C2C1C2 A2A2 D1D2D1D2 C1C2C1C2 7 exons BD1D2D1D2 C1C2C1C2 4 exons A1A2A1A2

15-20 september WABI0313 Results MEFC is MAX-SNP-hard (linear reduction from NODE-COVER) heuristic algorithm: Iterate process to factorize each EST backtracking to recompute previous EST factors if not compatible to G E

15-20 september WABI0314 The algorithm s i1 s i j-1 s ij SiSi e1e1 e2e2 G Iterative j th step: partial EST factorization of S i (compute factor s ij ) emem if (Compatible(e m, exon_list)) then add e m to exon_list; otherwise try to place s ij elsewhere; emem If not possible then backtrack; s i-1 1 s i-1 j-1 s i-1 j s i-1 n S i-1 After placing all the factors s ij for the set S, place the external factors;

15-20 september WABI0315 The algorithm (more details) G s i1 s i j-1 SiSi s i j Compute factor s ij S ij can be divided into n components c k (k=1,2,…,n) At least one of these components for k from 1 to (n-1) is error-free and can be placed on G s ij c1c1 c2c2 c3c3 c4c4 c5c5 The algorithm searches a perfect match of c 1 on G c1c1 Suppose that c 1 has no perfect match on G Then the algorithm searches a perfect match of c 2 on G c2c2 c1c1 c1c1 Suppose that c 2 has a perfect match on G c2c2 Then the entire factor s ij can be placed on GFind the canonical ag pattern on the left ag Find the rightmost gt pattern such that the edit distance between s ij y and the genomic substring from ag to gt is bounded gt s i j y exon

15-20 september WABI0316 ASPic (Alternative Splicing PredICtion) Input - A minimum length of an exon - A maximum number of exons in the exon factorization of the genomic sequence - An error percentage - A genomic sequence - An ESTs set (or cluster) Output - A text file for all ESTs alignments - An HTML file for the exon factorization of the genomic sequence

15-20 september WABI0317 ASPic data validation ASAP (Lee et al., 2003) Genomic sequences from ASAP database EST clusters of human chromosome 1 from UniGene database ASPic INPUT: Validation Database:

15-20 september WABI0318 Experimental Results Genomic sequence (official gene name) Introns detected by ASAP ASAP introns detected by ASPic Novel introns detected by ASPic Genomic shift detected by ASPic

15-20 september WABI0319 Execution times PENTIUM IV, 1600 MHZ, 256 MB, running Linux

15-20 september WABI0320 An example of data (gene HNRPR) ASPic finds a novel intron from 2144 to 5333 confirmed by 18 EST sequences Positions are from 0 for ASPic and from 1 for ASAP

15-20 september WABI0321 An example of data (gene HNRPR, intron ) EST ID Left and right ends of the two exons EST exonsGenomic exons

15-20 september WABI0322 WEB site

15-20 september WABI0323 WEB site

15-20 september WABI0324 WEB site

15-20 september WABI0325 Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole Responsabile disegno software: Raffaella Rizzi Sito WEB:Gabriele Ravanelli Rappresentazione grafica:Francesco Perego Anna Redondi Analisi dati:Francesca Rossin Altri contributi:Gianluca Dellavedova

15-20 september WABI0326 GRAZIE!