Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Support Vector Machines
SVM—Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology.
Ab initio gene prediction Genome 559, Winter 2011.
Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
Face Recognition & Biometric Systems, 2005/2006 Face recognition process.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.
The Molecular Genetics of Gene Expression
A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.
CSE182-L10 Gene Finding.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Lecture 12 Splicing and gene prediction in eukaryotes
Transcription. What I need to know:- 1.What transcription is 2.The stages of transcription 3.The use of enzymes in transcription 4.To identify what an.
Biological Motivation Gene Finding in Eukaryotic Genomes
DNA Biology Lab 11. Nucleic Acids  DNA and RNA both built of nucleotides containing Sugar (deoxyribose or ribose) Nitrogenous base (ATCG or AUCG) Phosphate.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Progress report Yiming Zhang 02/10/2012. All AS events in ASIP Intron retention Exon skipping Alternative Acceptor site NAGNAG AltA Alternative Donor.
Intelligent Systems for Bioinformatics Michael J. Watts
©2000 Timothy G. Standish Revelation 18:4 4And I heard another voice from heaven, saying, Come out of her, my people, that ye be not partakers of her sins,
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Splicing. Movie removed to simplify downloading. You can download the movie separately.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
DNA to Protein – 12 Part one AP Biology. What is a Gene? A gene is a sequence of DNA that contains the information or the code for a protein or an RNA.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
What is central dogma? From DNA to Protein
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
RBP1 Splicing Regulation in Drosophila Melanogaster Fall 2005 Jacob Joseph, Ahmet Bakan, Amina Abdulla This presentation available at
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
PRABINA KUMAR MEHER SCIENTIST DIVISION OF STATISTICAL GENETICS INDIAN AGRICULTURAL STATISTICS RESERARCH INSTITUTE INDIAN COUNCIL OF AGRICULTURAL RESEARCH.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Introduction to Molecular Biology and Genomics BMI/CS 776 Mark Craven January 2002.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
FUZZ-IEEE Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen and James Z. Wang The Pennsylvania State.
DNA and RNA Structure Biochemistry Connection: How is structure related to function?
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
bacteria and eukaryotes
Gene Expression and Protein Synthesis
What is a Hidden Markov Model?
Hebrews 1:1-2 1 God, who at sundry times and in divers manners spake in time past unto the fathers by the prophets, 2 Hath in these last days spoken unto.
The Chemical Building Blocks of Life
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
DNA and the Genome Key Area 3b Transcription.
Generalizations of Markov model to characterize biological sequences
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Modeling of Spliceosome
Determine CDS Coordinates
Gene Structure.
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Gene Structure.
Presentation transcript:

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų , Kaunas, Lithuania

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2 What is splicing? Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined Splice junctions: boundary points between exons and introns where splicing occurs Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 3 Problem Splice-junction site recognition  Important for successful gene prediction  Study of genetical deseases  Understanding of genetic mechanisms Difficulties  Noisy data  Pseudo splice sites  Non-canonical splice sites (intron is not GT...AG)  Alternative splicing  Multitude of consensus sequences Machine Learning: Support Vector Machine (SVM)  Feature space mapping for SVM  Which frequency-based feature mapping is the best?

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 4 Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 5 What factors influence quality of classification? Training data  size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space  Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function  linear, polynomial, RBF, sigmoid Kernel parameters SVM learning parameters  Regularization parameter, Cost factor

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 6 SVM feature space Feature space: multidimensional vector representing data instances Mapping of data into features: achieving better classification accuracy Feature space construction:  nucleotide position-dependent  nucleotide position-independent  both nucleotide position-dependent and -independent information Feature mapping rule:  N – the length of a DNA sequence, M – the length of feature vector

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 7 K-mers K-mer: a k-base long sequence (k-tuple) of DNA K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, n j – number of j-th k-mer in a DNA sequence

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 8 K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4  Disadvantage: feature space growth ~ 4 k Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2  Strong (C, G) nucleotides – 3 H bonds  Weak (A, T) nucleotides – 2 H bonds RY : Σ = {R, Y}, ||Σ|| = 2  A and G – purines (R)  C and T – pyrimidines (Y) KM : Σ = {K, M}, ||Σ|| = 2  A and C – amines (M)  G and T – ketones (K)

Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 9

10 Case study Dataset: UCI repository, Genbank 64.1 primate data  3175 sequences, each (-30 bp, +30 bp) with regard to splice site Three splice site recognition sub-problems:  Exon/Intron (EI) vs. Negative (N)  Intron/Exon (IE) vs. Negative (N)  Exon/Intron (EI) vs. Intron/Exon (IE) Three datasets:  EI vs. N : 767 EI and 1655 N  IE vs. N : 768 EI and 1655 N  EI vs. IE : 767 EI and 768 EI Power series kernel Accuracy evaluation metric: F-measure

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 11 Classification results: Exon/Intron vs. Negative

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 12 Classification results: Intron/Exon vs. Negative

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 13 Classification results: Intron/Exon vs. Exon/Intron

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 14 Classification time

Feature vector size Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 15 Intron/exon splice sites, 2422 sequences

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 16 Evaluation of results Classification accuracy:  Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%)  Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%)  E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%)  4-mers and 6-mers better than 5-mers  RY always better than SW or KM Feature space size:  ACGT k-mer: 4 k  SW, RY, KM k-mer: 2 k Classification speed:  SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 17 Why RY is better than SW or KM? RuleDonor (EI) consensusAcceptor (IE) consensus ACGT(C|A)AG / GT(A|G)AGT(C|T) n N(C|T)AG / G SW(S|W)SW / WS(S|W)SWS(S|W) n N(S|W)SW / W KM KKM / MM(K|M)KMM(K|M) n N(K|M)KM / M RY(R|Y)RR / RYRRRYY n NYRR / R Acceptor consensus sequence has long runs of Pyrimidines (Y)

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 18 Conclusions Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4- letter k-mer frequency Open research problem: “forbidden” k-mers

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 19 Questions?

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 20 SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function Advantage:  more parameters for optimization  better separation of classes in feature space

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 21 SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule reflects the difference in the number of hydrogen bonds in the DNA molecule  Strong (C, G) nucleotides - 3 H bonds  Weak (A, T) nucleotides - 2 H bonds related to physical-chemical properties of DNA  transport of electrons  mechanical waves along the DNA helix

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 22 RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T}) describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence.  A and G – purines (R)  C and T – pyrimidines (Y) corresponds to the chemical composition bias in the DNA strand

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 23 KM k-mer mapping rule The KM mapping rule ({A,C} vs. {G,T}) describes how ketones (K) and amines (M) are distributed along the DNA sequence  A and C – amines (M)  G and T – ketones (K)

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 24 Classification metric F-measure Advantage:  One measure that takes into account both recall and precision: a spectacular score in one does not compensate for a bad score in the other