Download presentation
Presentation is loading. Please wait.
Published byCatherine Dawson Modified over 9 years ago
1
Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University
2
Outline Problem & Motivation: Barcode of Life Spectrum kernels: exact and mismatch kernels Full spectrum kernel algorithms Exact spectrum kernel algorithms Adding mismatches: incremental mismatchkernel computations Sparse Spectrum kernels & algorithms Algorithmic complexity Experimental results Conclusions
3
Problem & Motivation Barcoding of Life: classification and identification of organisms using ’barcodes’ DNA barcode: a short fragment (≈600bp) from standard mitochondrial gene region assign an unidentified sample to the taxonomic group at a target level (class, family, species, etc.)using reference data with known class information Supported by many organizations (CBOL, BOLD) Applications: biodiversity monitoring and assessment, ecological studies, taxonomic research, etc.
4
Problem & Motivation Methods widely used: kernel-based machine learning methods most accurate results in many sequence analysis and prediction tasks often computationally expensive This work: general spectrum-like string kernel methods with the focus on DNA-based sequence analysis Goal: a fast and accurate classification: small space, fast to estimate, accurate results
5
Barcode-based species identification Problem: Given a reference set R of barcodes from different species S identify, for a newly obtained barcode b, its category or decide that it is of new category Identification at the species/sub-species level Multiclass classification with species as target classes, no phylogenetic analysis and/or higher level classification Solve multiple binary membership problems using kernel-based classifiers
6
Kernel methods for sequence analysis Fisher kernel: a profile HMM is built for each class and then used as a feature extractor Alignment-based kernels: measure mutual distances between sequences or their profiles local alignment (LA) kernels SVM-pairwise, etc. extremely time consuming Spectrum-like string kernels Many others!
7
String spectrum kernels General idea: sequence features (e.g. short substrings or certain patterns) to compute histograms of feature frequencies (spectrum) compare the feature histograms sequences with similar content have similar spectra Examples: the exact spectrum kernel: counts fixed- length substrings the spectrum kernel with mismatches: in addition to the explicit substrings, counts also similar substrings (neighbors) less sparse spectrum
8
The exact spectrum kernel The k-spectrum of a sequence x is the set of all its substrings of length k Dimensionality of the k-mer feature space is |Σ| k, where |Σ| is the size of alphabet The k-spectrum map Φ k (x): Φ k (x) : x → [φ k a (x)] ∀ a ∈ Σ k φ k a (x) = | {j : x(j : j − k + 1) = a} | The k-spectrum kernel: K k (x, y) =
9
Example 3-spectrum of ’ACGAC’: 4 3 = 64 vector Φ 3 (x) = [AAA/0,..., ACG/1,..., CAC/1,..., CGA/1...]
10
Mismatch kernel Two integer parameters: k, and m Parameter m is the maximum number of the character mismatches between two k- substrings The mismatch feature map: φ (k,m) a (x) = |{j : d(x(j : j + k − 1), a) ≤ m, a ∈ |Σ| k }| The mismatch kernel: K k,m (x,y) = (k, m)-neighborhood size v = Σ m i=0 C(k,i) (|Σ| − 1) i
11
Spectrum kernel computations Single kernel computations: compute kernel K(x, y) for two sequences Kernel matrix computations: compute NxN matrix for N sequences Kernel vector computations: compute Nx1 kernel vector for sequence x and N support sequences Focus on kernel matrices and vectors (SVM training, evaluation/testing, clustering, etc.)
12
Exact spectrum kernel computations Suffix tree-based algorithms: for each pair (x, y) build a suffix tree for sequence features, and compute the kernel K k (x, y) at the tree leaves O(N 2 kn) for N sequences of length n each Explicit mapping (EMap) algorithms explicitly map sequences to feature vectors of size |Σ k |, kernel matrix is then computed as K = M.M’, where M N×|Σ| k is a feature matrix O(Nkn + |Σ k |N 2 ) Sorting-based algorithms: Represent k-mers as numbers and use favorite sorting algorithm
13
Counting-sort spectrum kernel Algorithm: 1. Extract k-length substrings from input sequences and store them in a list L, O(knN ) 2. Sort the list L using k passes of counting sort, O(knN ). 3. Scan the sorted list updating the kernel matrix on each change in the feature value Complexity: O(Nnk + min(u, n)· N 2 ), u - number of unique k-mers in the input Update step: K(upd f, upd f ) = K(upd f, upd f ) + c f c f T upd f = {i : f ∈ x i } = input sequences containing f c f = vector of feature counts
14
Mismatch kernel using sorting Sorting-based computations: Extract unique features using sorting Expand a set of unique features to include their neighbors Sort the resulting set Scan the sorted list and update kernel matrix on each change in the feature value Complexity: O(Nnk + uvk + uN 2 )
15
Divide-and-Conquer Mismatch Kernel Basis idea: Infer count for a k-mer f using counts of its neighbors Cluster the combined feature set S = ∑ N i=1 Spectrum(x i ) to find sets of neighboring features The size of resulting clusters/subclusters gives desired counts of feature occurrences For DNA, since u << nN, improve performance by using a unique features instead of the original redundant set.
16
Divide-and-Conquer Method Divide step: the combined feature set S is partitioned into subsets S 1,..., S |Σ| using character-based clustering Conquer step: The same procedure (Divide step)is applied to each of the obtained subsets. After k divisions, kernel matrix is updated according to the contribution of the corresponding k-mer f. Complexity:
17
Sparse kernel Can we fur ther reduce computation costs? Preselect features (e.g. using filtering) and evaluate kernel for a set F of selected features K = M F T M F, M F is |F|×N matrix of feature counts Reduces complexity of computations: Spectrum kernel: O(Nnk + FN 2 ) vs O(Nnk + uN 2 ) Mismatch kernel: O(Nnk + Fvk + FN 2 ) vs O(Nnk + uvk + uN 2 )
18
Complexity comparison: Spectrum Previously known bounds: O(knN 2 + nN 2 ) New bounds: O(knN + uN 2 ) Advantages of counting sort-based computations: more time efficient smaller memory requirements in practice than suffix trees easier to implement
19
Complexity comparison: Mismatch Previously known bounds: O(nk m+1 |Σ| m N 2 ) New bounds: O(uk m+1 |Σ| m + uN 2 ) EMap=explicit map, EMap+Sort=EMap with presorting, DC=divide and conquer v=neighborhood size, u=number of different k-mers in the input, u′=number of different k-mers including neighbors
20
Experimental framework Barcode datasets Classification: multiclass & binary (CV/ROC/ROC50) Kernels: Fisher, Spectrum, Mismatch Algorithms: SVM, ridge regression, 1-NN Running time analysis: training (matrix) and testing (vector) different kernel parameters: k, m different feature selection levels
21
Classification performance: Multiclass 10-fold cross-validation MK=mismatch kernel, SK=spectrum kernel, SMK, SSK=with feature selection, NN=nearest neighbor, FK=Fisher kernel 10% features, classifiers improve/retain performance Improved performance compared to previous studies in [Matz & Nielsen, 2005] and [Nielsen & Matz, 2006](rates of 9-20%) %Error
22
Classification performance: Binary 10-fold cross-validation error rates Average ROC/ROC50 scores
23
Classification performance: ROC Feature selection improves performance 90% reduction in the number of features AstraptesHesperiidae
24
Running time: Mismatch Running time, seconds Significant time improvement compared to the state-of-art spectrum kernel implementation EMap requires much larger storage than D&C Pre-sorting significantly improves computing time for EMap
25
Mismatch + Feature Selection D&C scales almost linearly with the number of features
26
Running time: Mismatch vector D&C outperforms EMap in many cases while requiring only linear space
27
Summary of results Efficient computation of spectrum kernel matrices and vectors: counting sort and divide- and-conquer techniques Spectrum kernels for accurate and fast DNA barcode-based species identification Few sequence features can successfully discriminate species Small discriminative subsets of k-mers (signatures)for many taxonomic groups
28
Future work Position-aware string kernels: taking feature interactions into account Smoothed kernels (independent of a choice of k) Efficient feature selection methods Learning low-dimensional representations Semi-supervised setting Direct multiclass methods
29
References [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152–160, Washington, DC, USA, 2004. IEEE Computer Society. [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566– 575, 2002. [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417–1424, 2002. [P.D.N. 03] Heber t P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313–322, 2003. [Vishwanathan 02] S. V. N. Vishwanathan & Alexander J. Smola. Fast Kernels for String and Tree Matching. In NIPS, pages 569–576, 2002.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.