Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Ab initio gene prediction Genome 559, Winter 2011.
Finding Eukaryotic Open reading frames.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Comparative ab initio prediction of gene structures using pair HMMs
New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Lecture 12 Splicing and gene prediction in eukaryotes
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Progress report Yiming Zhang 02/10/2012. All AS events in ASIP Intron retention Exon skipping Alternative Acceptor site NAGNAG AltA Alternative Donor.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Random Sets Approach and its Applications Basic iterative feature selection, and modifications. Tests for independence & trimmings (similar to HITON algorithm).
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
RNA Rearrangements in the Spliceosome
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Pre-mRNA secondary structures influence exon recognition Michael Hiller Bioinformatics Group University of Freiburg, Germany.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Hybrid Ant Colony Optimization-Support Vector Machine using Weighted Ranking for Feature Selection and Classification.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
10. Decision Trees and Markov Chains for Gene Finding.
bacteria and eukaryotes
Results for all features Results for the reduced set of features
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Evaluating classifiers for disease gene discovery
Introduction Feature Extraction Discussions Conclusions Results
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Reference based assembly
Finding Functionally Significant Structural Motifs in Proteins
Finding regulatory modules
Alternative Splicing May Not Be the Key to Proteome Complexity
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Modeling of Spliceosome
Alternative Splicing: New Insights from Global Analyses
Gene Structure.
Deep Learning in Bioinformatics
Gene Structure.
Presentation transcript:

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating different gene transcripts (called isoforms) from the same genomic sequence. Finding alternative splicing events experimentally is both expensive and time consuming. Computational methods, in general, and machine learning algorithms, in particular, can be used to complement experimental methods in the process of identifying alternative splicing events. In this paper, we explore the predictive power of a rich set of features that have been experimentally shown to affect alternative splicing. We use these features to build support vector machine (SVM) classifiers for distinguishing between alternatively spliced exons and constitutive exons. Our results show that simple linear SVM classifiers built from a rich set of features give results comparable to those of more sophisticated SVM classifiers that use more basic sequence features. Furthermore, we use feature selection methods to identify computationally the most informative features for the prediction. Introduction Data Set Strength of splice sites: for each splice site in a sequence,we calculate a strength score: (-3, +7) around donor sites, and (-26, +2) around acceptor sites. The range of this region covers the main AG dinucleotide which are binded by splicing factor around acceptor sites and the adjacent polypyrimidine tracts (PPT). For years, it was believed that one gene corresponds to one protein, but the discovery of alternative splicing provided a mechanism through which one gene can generate several distinct proteins. The task of accurately identifying alternative splicing isoforms is particularly intricate, as different transcriptional isoforms can be found in different tissues or cell types, at different development stages, or can be induced by external stimuli. ASEs are most frequent alternative splicing events among other patterns. Two transcripts isofomrs in the fig have different splicing of exons. The red labeled exon is skipped by one isoform. Other patterns of alternative splicing has alternative 3' and 5' exons, exons with intron retention. Motif features: three sets of motifs, intronic splicing regulator, MEME/MAST motifs, Exonic splicing enhancer. Publications [1] G. R Ratsch, S. Sonnenburg, and B. Scholkof. RASE: recog nition of alternative spliced exons in C. elegans Bioinformatics 21(Suppl 1):369–377, June [2] projects/RASE Feature Selection SVM feature importance: the importance of each class of features was estimated by taking the average, across all features in a class, of the corresponding feature weight in the weight vector w Information Gain: for measuring the importance of the features Support Vector Machine The support vector machine (SVM) algorithm is one of the most effective machine learning algorithms for many complex binary classification problems. We include previous defined features as input space for SVM. Model selection results in judicious choice of various parameters of the algorithm. e.g. the cost of constraint violation and kernel function of SVM Alternative Spliced Exons Secondary structure of pre-mRNA: affect alternative splicing through restrain proteins binding to motif sites based on motifs' locations. It was used as a filter to reduce motif sets. Optimal folding energy: corresponds to the stability of folding of secondary structure The data set contains alternatively spliced and constitutive exons in C.elegans. It has been used in related work [1] and is available at [2]. It contains 487 alternatively spliced exons and 2531 constitutive exons. The final data set was split into 5 independent subsets of training and testing files for cross validation purposes. Motif Evaluation The purpose of the motif evaluation in this section is to identify the splicing motifs that appear in several different sets, as those motifs are probably the most informative for alternative splicing. E.g. Motif from MEME GC content ratio: defined as GC-ratio, a sliding window scanning the local sequence around each splice site. Lengths of the exon: to be predicted and its flanking introns. Frame of the stop codon Experimental Results Comparison of ROC based on mixed features and basic sequence features AUC score comparison between data sets with reduced motifs by secondary structure and data set s with unfiltered motifs Intersection between MAST motifs and intronic splicing regulators