Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University.

Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University

Outline Problem & Motivation: Barcode of Life Spectrum kernels: exact and mismatch kernels Full spectrum kernel algorithms  Exact spectrum kernel algorithms  Adding mismatches: incremental mismatchkernel computations Sparse Spectrum kernels & algorithms Algorithmic complexity Experimental results Conclusions

Problem & Motivation Barcoding of Life: classification and identification of organisms using ’barcodes’ DNA barcode: a short fragment (≈600bp) from standard mitochondrial gene region assign an unidentified sample to the taxonomic group at a target level (class, family, species, etc.)using reference data with known class information Supported by many organizations (CBOL, BOLD) Applications: biodiversity monitoring and assessment, ecological studies, taxonomic research, etc.

Problem & Motivation Methods widely used: kernel-based machine learning methods  most accurate results in many sequence analysis and prediction tasks  often computationally expensive This work: general spectrum-like string kernel methods with the focus on DNA-based sequence analysis Goal: a fast and accurate classification:  small space, fast to estimate, accurate results

Barcode-based species identification Problem: Given a reference set R of barcodes from different species S identify, for a newly obtained barcode b, its category or decide that it is of new category Identification at the species/sub-species level Multiclass classification with species as target classes, no phylogenetic analysis and/or higher level classification Solve multiple binary membership problems using kernel-based classifiers

Kernel methods for sequence analysis Fisher kernel: a profile HMM is built for each class and then used as a feature extractor Alignment-based kernels: measure mutual distances between sequences or their profiles  local alignment (LA) kernels  SVM-pairwise, etc.  extremely time consuming Spectrum-like string kernels Many others!

String spectrum kernels General idea:  sequence features (e.g. short substrings or certain patterns) to compute histograms of feature frequencies (spectrum)  compare the feature histograms  sequences with similar content have similar spectra Examples:  the exact spectrum kernel: counts fixed- length substrings  the spectrum kernel with mismatches:  in addition to the explicit substrings, counts also similar substrings (neighbors)  less sparse spectrum

The exact spectrum kernel The k-spectrum of a sequence x is the set of all its substrings of length k Dimensionality of the k-mer feature space is |Σ| k, where |Σ| is the size of alphabet The k-spectrum map Φ k (x):  Φ k (x) : x → [φ k a (x)] ∀ a ∈ Σ k  φ k a (x) = | {j : x(j : j − k + 1) = a} | The k-spectrum kernel: K k (x, y) =

Example 3-spectrum of ’ACGAC’: 4 3 = 64 vector Φ 3 (x) = [AAA/0,..., ACG/1,..., CAC/1,..., CGA/1...]

Mismatch kernel Two integer parameters: k, and m Parameter m is the maximum number of the character mismatches between two k- substrings The mismatch feature map: φ (k,m) a (x) = |{j : d(x(j : j + k − 1), a) ≤ m, a ∈ |Σ| k }| The mismatch kernel: K k,m (x,y) = (k, m)-neighborhood size v = Σ m i=0 C(k,i) (|Σ| − 1) i

Spectrum kernel computations Single kernel computations: compute kernel K(x, y) for two sequences Kernel matrix computations: compute NxN matrix for N sequences Kernel vector computations: compute Nx1 kernel vector for sequence x and N support sequences Focus on kernel matrices and vectors (SVM training, evaluation/testing, clustering, etc.)

Exact spectrum kernel computations Suffix tree-based algorithms:  for each pair (x, y) build a suffix tree for sequence features, and compute the kernel K k (x, y) at the tree leaves  O(N 2 kn) for N sequences of length n each Explicit mapping (EMap) algorithms  explicitly map sequences to feature vectors of size |Σ k |, kernel matrix is then computed as K = M.M’, where M N×|Σ| k is a feature matrix  O(Nkn + |Σ k |N 2 ) Sorting-based algorithms:  Represent k-mers as numbers and use favorite sorting algorithm

Counting-sort spectrum kernel Algorithm: 1. Extract k-length substrings from input sequences and store them in a list L, O(knN ) 2. Sort the list L using k passes of counting sort, O(knN ). 3. Scan the sorted list updating the kernel matrix on each change in the feature value Complexity: O(Nnk + min(u, n)· N 2 ), u - number of unique k-mers in the input Update step:  K(upd f, upd f ) = K(upd f, upd f ) + c f c f T  upd f = {i : f ∈ x i } = input sequences containing f  c f = vector of feature counts

Mismatch kernel using sorting Sorting-based computations:  Extract unique features using sorting  Expand a set of unique features to include their neighbors  Sort the resulting set  Scan the sorted list and update kernel matrix on each change in the feature value Complexity: O(Nnk + uvk + uN 2 )

Divide-and-Conquer Mismatch Kernel Basis idea:  Infer count for a k-mer f using counts of its neighbors  Cluster the combined feature set S = ∑ N i=1 Spectrum(x i ) to find sets of neighboring features  The size of resulting clusters/subclusters gives desired counts of feature occurrences For DNA, since u << nN, improve performance by using a unique features instead of the original redundant set.

Divide-and-Conquer Method Divide step: the combined feature set S is partitioned into subsets S 1,..., S |Σ| using character-based clustering Conquer step: The same procedure (Divide step)is applied to each of the obtained subsets. After k divisions, kernel matrix is updated according to the contribution of the corresponding k-mer f. Complexity:

Sparse kernel Can we fur ther reduce computation costs?  Preselect features (e.g. using filtering) and evaluate kernel for a set F of selected features  K = M F T M F, M F is |F|×N matrix of feature counts Reduces complexity of computations:  Spectrum kernel: O(Nnk + FN 2 ) vs O(Nnk + uN 2 )  Mismatch kernel: O(Nnk + Fvk + FN 2 ) vs O(Nnk + uvk + uN 2 )

Complexity comparison: Spectrum Previously known bounds: O(knN 2 + nN 2 ) New bounds: O(knN + uN 2 ) Advantages of counting sort-based computations:  more time efficient  smaller memory requirements in practice than suffix trees  easier to implement

Complexity comparison: Mismatch Previously known bounds: O(nk m+1 |Σ| m N 2 ) New bounds: O(uk m+1 |Σ| m + uN 2 ) EMap=explicit map, EMap+Sort=EMap with presorting, DC=divide and conquer v=neighborhood size, u=number of different k-mers in the input, u′=number of different k-mers including neighbors

Experimental framework Barcode datasets Classification: multiclass & binary (CV/ROC/ROC50) Kernels: Fisher, Spectrum, Mismatch Algorithms: SVM, ridge regression, 1-NN Running time analysis:  training (matrix) and testing (vector)  different kernel parameters: k, m  different feature selection levels

Classification performance: Multiclass 10-fold cross-validation MK=mismatch kernel, SK=spectrum kernel, SMK, SSK=with feature selection, NN=nearest neighbor, FK=Fisher kernel 10% features, classifiers improve/retain performance Improved performance compared to previous studies in [Matz & Nielsen, 2005] and [Nielsen & Matz, 2006](rates of 9-20%) %Error

Classification performance: Binary 10-fold cross-validation error rates Average ROC/ROC50 scores

Classification performance: ROC Feature selection improves performance 90% reduction in the number of features AstraptesHesperiidae

Running time: Mismatch Running time, seconds Significant time improvement compared to the state-of-art spectrum kernel implementation EMap requires much larger storage than D&C Pre-sorting significantly improves computing time for EMap

Mismatch + Feature Selection D&C scales almost linearly with the number of features

Running time: Mismatch vector D&C outperforms EMap in many cases while requiring only linear space

Summary of results Efficient computation of spectrum kernel matrices and vectors: counting sort and divide- and-conquer techniques Spectrum kernels for accurate and fast DNA barcode-based species identification Few sequence features can successfully discriminate species Small discriminative subsets of k-mers (signatures)for many taxonomic groups

Future work Position-aware string kernels: taking feature interactions into account Smoothed kernels (independent of a choice of k) Efficient feature selection methods Learning low-dimensional representations Semi-supervised setting Direct multiclass methods

References [Kuang 04] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund & Christina Leslie. Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), pages 152–160, Washington, DC, USA, 2004. IEEE Computer Society. [Leslie 02a] Christina S. Leslie, Eleazar Eskin & William Stafford Noble. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Pacific Symposium on Biocomputing, pages 566– 575, 2002. [Leslie 02b] Christina S. Leslie, Eleazar Eskin, Jason Weston & William Stafford Noble. Mismatch String Kernels for SVM Protein Classification. In NIPS, pages 1417–1424, 2002. [P.D.N. 03] Heber t P.D.N., A. Cywinska, S.L. Ball & J.R. deWaard. Biological identifications through DNA barcodes. In Proceedings of the Royal Society of London, pages 313–322, 2003. [Vishwanathan 02] S. V. N. Vishwanathan & Alexander J. Smola. Fast Kernels for String and Tree Matching. In NIPS, pages 569–576, 2002.

Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University.

Similar presentations

Presentation on theme: "Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University.

Similar presentations

Presentation on theme: "Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback