Download presentation
Presentation is loading. Please wait.
Published byLindsay Flowers Modified over 9 years ago
1
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization [Yu et al. 10’] Restrict to use at most 2GB memory. Compared methods: Online methods: VW, Perceptron, MIRA, CW. Block Minimization Framework [Yu et al. 2010]: BMD BM-PEGASOS. Proposed methods: SBM with various settings. Motivation Linear Classifiers handle large-scale data well. But, a key challenge for large-scale classification is dealing with data sets that cannot fit in memory. Examples include: Spam Filtering, Classifying Query Logs, Classifying Web Data and Images. Recent public challenges are also large scale: Spam filtering data in Pascal challenge: 20GB. ImageNet challenge: ~130GB. Existing methods for learning large-scale data: Data smaller than memory: efficient training methods are well-developed. Data beyond disk size: distributed solution. When data cannot fit into memory: Batch Learner: suffers due to disk swapping. Online Learner: requires a large number of iterations to converge large amount of I/O. Block minimization[Yu et al. 2010]: requires many block loads since it treats all training examples uniformly. Our Challenge: How to efficiently handle data larger than memory capacity in one machine. Training time = running time in memory (training) + accessing data from disk. Two orthogonal directions to reduce disk access: 1)Apply compression to lessen loading time. 2)Better learning from data in memory. E.g., better utilizing memory in learning Selective Block Minimization (SBM) algorithm: select informative examples and cache them in memory. The SBM algorithm has following properties: Our method significantly improves the I/O cost: Spam filtering data: SBM obtained an accurate model with just a single pass over data set (the current best method requires 10 rounds). As a result, SBM efficiently gives a stable model. SBM maintains good convergence properties: Usually a method that selects to cache data and treats samples non-uniformly can only converge to an approximate solution. SBM can be proved to converge linearly to the global optimal solution on the entire data regardless of the cache selection strategy used. Has been implemented in a branch of liblinear. Training Time and Testing Accuracy Experiments on streaming data Split data into blocks and store them in compressed files. At each time, load and train on a data block by solving a sub-problem. The drawback: spends the same time and memory on important and unimportant samples. Intuitive: try to only focus on informative samples. If we have an oracle of support vectors, we only need to train the model on the support vectors. At step it considers a data block consisting of 1)new data load from disk B j, 2)informative samples cached in memory t, j it then solves: Sub-problems can be solved by a linear classification package such as LIBLINEAR. Finally, it updates the model and select cached samples t, j+1 from { t, j ∪ B j }. Related to selective sampling and shrinking strategy. The samples that are close to the separator are usually important – cache these samples. Define a scoring function and keep the samples with higher scores in cache Disk level shrinking. Selecting Cached data Selective Block Minimization Learning From Stream Data Implementation Issues Store cached samples in memory: need a careful design to avoid duplicating data. Deal with large : We need to store all the variables even the corresponding instances are not in memory. Can be very large if #instances is large: 1)If is sparse (support vectors << #instances) we can store nonzero in a hash table. 2) Otherwise, we can store unused in disk. We can treat a large data set as a stream and process the data in an online manner. Most online learners cannot obtain a good model with only single pass over data because they use a simple update rule on one sample at a time. Apply SBM with only one iteration: considering a data block at a time can be better. No need to store if the corresponding instances are removed from memory Keep more data in memory. The model is more stable and can adjust to the available resources. Training Time Analysis Contributions The Algorithm This research is sponsored by DARPA Machine Reading Program and Bootstrapping Learning Program TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA A A AA Linear SVM The code to regenerate the experiments can be download at: http://cogcomp.cs.illinois.edu/page/publication_view/660
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.