Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania robertas.damasevicius@ktu.lt

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2 What is splicing? Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined Splice junctions: boundary points between exons and introns where splicing occurs Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 3 Problem Splice-junction site recognition  Important for successful gene prediction  Study of genetical deseases  Understanding of genetic mechanisms Difficulties  Noisy data  Pseudo splice sites  Non-canonical splice sites (intron is not GT...AG)  Alternative splicing  Multitude of consensus sequences Machine Learning: Support Vector Machine (SVM)  Feature space mapping for SVM  Which frequency-based feature mapping is the best?

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 4 Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 5 What factors influence quality of classification? Training data  size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space  Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function  linear, polynomial, RBF, sigmoid Kernel parameters SVM learning parameters  Regularization parameter, Cost factor

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 6 SVM feature space Feature space: multidimensional vector representing data instances Mapping of data into features: achieving better classification accuracy Feature space construction:  nucleotide position-dependent  nucleotide position-independent  both nucleotide position-dependent and -independent information Feature mapping rule:  N – the length of a DNA sequence, M – the length of feature vector

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 7 K-mers K-mer: a k-base long sequence (k-tuple) of DNA K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, n j – number of j-th k-mer in a DNA sequence

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 8 K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4  Disadvantage: feature space growth ~ 4 k Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2  Strong (C, G) nucleotides – 3 H bonds  Weak (A, T) nucleotides – 2 H bonds RY : Σ = {R, Y}, ||Σ|| = 2  A and G – purines (R)  C and T – pyrimidines (Y) KM : Σ = {K, M}, ||Σ|| = 2  A and C – amines (M)  G and T – ketones (K)

Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 9

10 Case study Dataset: UCI repository, Genbank 64.1 primate data  3175 sequences, each (-30 bp, +30 bp) with regard to splice site Three splice site recognition sub-problems:  Exon/Intron (EI) vs. Negative (N)  Intron/Exon (IE) vs. Negative (N)  Exon/Intron (EI) vs. Intron/Exon (IE) Three datasets:  EI vs. N : 767 EI and 1655 N  IE vs. N : 768 EI and 1655 N  EI vs. IE : 767 EI and 768 EI Power series kernel Accuracy evaluation metric: F-measure

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 11 Classification results: Exon/Intron vs. Negative

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 12 Classification results: Intron/Exon vs. Negative

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 13 Classification results: Intron/Exon vs. Exon/Intron

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 14 Classification time

Feature vector size Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 15 Intron/exon splice sites, 2422 sequences

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 16 Evaluation of results Classification accuracy:  Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%)  Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%)  E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%)  4-mers and 6-mers better than 5-mers  RY always better than SW or KM Feature space size:  ACGT k-mer: 4 k  SW, RY, KM k-mer: 2 k Classification speed:  SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 18 Conclusions Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4- letter k-mer frequency Open research problem: “forbidden” k-mers

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 19 Questions?

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 20 SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function Advantage:  more parameters for optimization  better separation of classes in feature space

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 21 SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule reflects the difference in the number of hydrogen bonds in the DNA molecule  Strong (C, G) nucleotides - 3 H bonds  Weak (A, T) nucleotides - 2 H bonds related to physical-chemical properties of DNA  transport of electrons  mechanical waves along the DNA helix

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 22 RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T}) describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence.  A and G – purines (R)  C and T – pyrimidines (Y) corresponds to the chemical composition bias in the DNA strand

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 23 KM k-mer mapping rule The KM mapping rule ({A,C} vs. {G,T}) describes how ketones (K) and amines (M) are distributed along the DNA sequence  A and C – amines (M)  G and T – ketones (K)

Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 24 Classification metric F-measure Advantage:  One measure that takes into account both recall and precision: a spectacular score in one does not compensate for a bad score in the other

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.

Similar presentations

Presentation on theme: "Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.

Similar presentations

Presentation on theme: "Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius."— Presentation transcript:

Similar presentations

About project

Feedback