Download presentation
Presentation is loading. Please wait.
Published byAnabel Day Modified over 9 years ago
1
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania robertas.damasevicius@ktu.lt
2
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 2 What is splicing? Splicing: modification of genetic information after transcription, in which introns are removed and exons are joined Splice junctions: boundary points between exons and introns where splicing occurs Donor: upstream part of intron, conserved dinucleotide GT Acceptor: downstream part of intron, conserved dinucleotide AG Pseudo splice-sites
3
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 3 Problem Splice-junction site recognition Important for successful gene prediction Study of genetical deseases Understanding of genetic mechanisms Difficulties Noisy data Pseudo splice sites Non-canonical splice sites (intron is not GT...AG) Alternative splicing Multitude of consensus sequences Machine Learning: Support Vector Machine (SVM) Feature space mapping for SVM Which frequency-based feature mapping is the best?
4
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 4 Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.
5
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 5 What factors influence quality of classification? Training data size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function linear, polynomial, RBF, sigmoid Kernel parameters SVM learning parameters Regularization parameter, Cost factor
6
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 6 SVM feature space Feature space: multidimensional vector representing data instances Mapping of data into features: achieving better classification accuracy Feature space construction: nucleotide position-dependent nucleotide position-independent both nucleotide position-dependent and -independent information Feature mapping rule: N – the length of a DNA sequence, M – the length of feature vector
7
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 7 K-mers K-mer: a k-base long sequence (k-tuple) of DNA K-mer feature vector: constructed using a frequency (or probability) of each k-mer in a DNA sequence Σ – alphabet, N – length of a DNA sequence, k – length of k-mer, n j – number of j-th k-mer in a DNA sequence
8
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 8 K-mer frequency mapping rules 4-letter (ACGT) : Σ = {A, C, G, T}, ||Σ|| = 4 Disadvantage: feature space growth ~ 4 k Nucleotide grouping based: SW, KM & RY SW : Σ = {S, W}, ||Σ|| = 2 Strong (C, G) nucleotides – 3 H bonds Weak (A, T) nucleotides – 2 H bonds RY : Σ = {R, Y}, ||Σ|| = 2 A and G – purines (R) C and T – pyrimidines (Y) KM : Σ = {K, M}, ||Σ|| = 2 A and C – amines (M) G and T – ketones (K)
9
Example: 2-mer frequency mapping Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 9
10
10 Case study Dataset: UCI repository, Genbank 64.1 primate data 3175 sequences, each (-30 bp, +30 bp) with regard to splice site Three splice site recognition sub-problems: Exon/Intron (EI) vs. Negative (N) Intron/Exon (IE) vs. Negative (N) Exon/Intron (EI) vs. Intron/Exon (IE) Three datasets: EI vs. N : 767 EI and 1655 N IE vs. N : 768 EI and 1655 N EI vs. IE : 767 EI and 768 EI Power series kernel Accuracy evaluation metric: F-measure
11
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 11 Classification results: Exon/Intron vs. Negative
12
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 12 Classification results: Intron/Exon vs. Negative
13
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 13 Classification results: Intron/Exon vs. Exon/Intron
14
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 14 Classification time
15
Feature vector size Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 15 Intron/exon splice sites, 2422 sequences
16
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 16 Evaluation of results Classification accuracy: Exon/Intron vs. N. – 4-mer ACGT frequency mapping (78.05%) Intron/Exon vs. N. – 6-mer ACGT frequency mapping (70.75%) E/I vs. I/E – 6-mer ACGT frequency mapping (90.59%) 4-mers and 6-mers better than 5-mers RY always better than SW or KM Feature space size: ACGT k-mer: 4 k SW, RY, KM k-mer: 2 k Classification speed: SW/KM/RY k-mer frequency based classification can be ~ 2 times faster than ACGT k-mer classficaion
17
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 17 Why RY is better than SW or KM? RuleDonor (EI) consensusAcceptor (IE) consensus ACGT(C|A)AG / GT(A|G)AGT(C|T) n N(C|T)AG / G SW(S|W)SW / WS(S|W)SWS(S|W) n N(S|W)SW / W KM KKM / MM(K|M)KMM(K|M) n N(K|M)KM / M RY(R|Y)RR / RYRRRYY n NYRR / R Acceptor consensus sequence has long runs of Pyrimidines (Y)
18
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 18 Conclusions Selection of the appropriate feature mapping rule can greatly influence the DNA sequence classification results Anomalies in consensus sequences (such as long runs) can be exploited for better classification results when selecting mapping rules For trade-off between classification accuracy and speed, RY k-mer frequency based mapping can be used instead of 4- letter k-mer frequency Open research problem: “forbidden” k-mers
19
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 19 Questions?
20
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 20 SVM kernel function optimization Introduction of additional kernel parameters Introduction of new kernels Power series kernel function Advantage: more parameters for optimization better separation of classes in feature space
21
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 21 SW k-mer frequency mapping rule SW ({A,T} vs. {C,G}) mapping rule reflects the difference in the number of hydrogen bonds in the DNA molecule Strong (C, G) nucleotides - 3 H bonds Weak (A, T) nucleotides - 2 H bonds related to physical-chemical properties of DNA transport of electrons mechanical waves along the DNA helix
22
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 22 RY k-mer frequency mapping rule The RY mapping rule ({A, G} vs.{C, T}) describes how purines (R) and pyrimidines (Y) are distributed along the DNA sequence. A and G – purines (R) C and T – pyrimidines (Y) corresponds to the chemical composition bias in the DNA strand
23
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 23 KM k-mer mapping rule The KM mapping rule ({A,C} vs. {G,T}) describes how ketones (K) and amines (M) are distributed along the DNA sequence A and C – amines (M) G and T – ketones (K)
24
Int. Workshop on Intelligent Informatics in Biology and Medicine (IIBM’2008), March 4-7, 2008, Barcelona, Spain 24 Classification metric F-measure Advantage: One measure that takes into account both recall and precision: a spectacular score in one does not compensate for a bad score in the other
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.