Divide-and-Conquer and Statistical Inference for Big Data

Slides:



Advertisements
Similar presentations
Computational and Statistical Tradeoffs in Inference for “Big Data”
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
T HE ‘N ORMAL ’ D ISTRIBUTION. O BJECTIVES Review the Normal Distribution Properties of the Standard Normal Distribution Review the Central Limit Theorem.
Unsupervised Learning
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Mean, Proportion, CLT Bootstrap
Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
What is the Big Data Problem and How Do We Take Aim at It? Michael I. Jordan University of California, Berkeley October 25, 2012.
Experimental Design, Response Surface Analysis, and Optimization
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment, Selection and Averaging
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 Simple Regression.
Resampling techniques
Evaluation.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Evaluating Hypotheses
Bootstrapping in regular graphs
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Active Appearance Models Suppose we have a statistical appearance model –Trained from sets of examples How do we use it to interpret new images? Use an.
Inference about a Mean Part II
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
Bootstrapping a Heteroscedastic Regression Model with Application to 3D Rigid Motion Evaluation Bogdan Matei Peter Meer Electrical and Computer Engineering.
Bootstrapping applied to t-tests
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
For Better Accuracy Eick: Ensemble Learning
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Computing
1 CSI5388 Error Estimation: Re-Sampling Approaches.
Empirical Research Methods in Computer Science Lecture 2, Part 1 October 19, 2005 Noah Smith.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
7-1 Estim Unit 7 Statistical Inference - 1 Estimation FPP Chapters 21,23, Point Estimation Margin of Error Interval Estimation - Confidence Intervals.
Biostatistics IV An introduction to bootstrap. 2 Getting something from nothing? In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many.
PARAMETRIC STATISTICAL INFERENCE
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Experimental Evaluation of Learning Algorithms Part 1.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Big Data, Computation and Statistics Michael I. Jordan February 23,
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Analysis of Chromium Emissions Data Nagaraj Neerchal and Justin Newcomer, UMBC and OIAA/OEI and Mohamed Seregeldin, Office of Air Quality Planning and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CpSc 881: Machine Learning
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Canadian Bioinformatics Workshops
Estimating standard error using bootstrap
Data Science Credibility: Evaluating What’s Been Learned
A Scalable Bootstrap for Massive Data
Machine Learning Basics
When we free ourselves of desire,
How Confident Are You?.
Presentation transcript:

Divide-and-Conquer and Statistical Inference for Big Data Michael I. Jordan University of California, Berkeley September 8, 2012

Statistics Meets Computer Science I Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

Statistics Meets Computer Science II Bring algorithmic principles more fully into contact with statistical inference The principle in today’s talk: divide-and-conquer

Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points

Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points That’s fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic

Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points That’s fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic With subsampled data sets, estimates of uncertainty will be inflated, and in general we don’t know how to rescale

Inference and Massive Data Presumably any methodology that allows us to scale to terabytes and beyond will need to subsample data points That’s fine for point estimates (although challenging to do well), but for estimates of uncertainty it seems quite problematic With subsampled data sets, estimates of uncertainty will be inflated, and in general we don’t know how to rescale Let’s think this through in the setting of the bootstrap

Part I: The Big Data Bootstrap with Ariel Kleiner, Purnamrita Sarkar and Ameet Talwalkar University of California, Berkeley

Assessing the Quality of Inference Machine learning and computer science are full of algorithms for clustering, classification, regression, etc what’s missing: a focus on the uncertainty in the outputs of such algorithms (“error bars”) An application that has driven our work: develop a database that returns answers with error bars to all queries The bootstrap is a generic framework for computing error bars (and other assessments of quality) Can it be used on large-scale problems?

The Bootstrap and Scale Why the bootstrap? Naively, it would seem to be a perfect match to modern distributed computing environments the bootstrap involves resampling with replacement each bootstrap resample can be processed independently in parallel But the expected number of distinct points in a bootstrap resample is ~ 0.632n If original dataset has size 1 TB, then each resample will have size ~ 632 GB can’t do the bootstrap on massive data!

Assessing the Quality of Inference Observe data X1, ..., Xn

Assessing the Quality of Inference Observe data X1, ..., Xn Form a “parameter” estimate qn = q(X1, ..., Xn)

Assessing the Quality of Inference Observe data X1, ..., Xn Form a “parameter” estimate qn = q(X1, ..., Xn) Want to compute an assessment x of the quality of our estimate qn (e.g., a confidence region)

The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn.

The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn. But, we only observe one dataset of size n.

The Underlying Population

The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn. But, we only observe one dataset of size n.

Sampling

Approximation

Pretend The Sample Is The Population

The Bootstrap (Efron, 1979) Use the observed data to simulate multiple datasets of size n: Repeatedly resample n points with replacement from the original dataset of size n. Compute q*n on each resample. Compute x based on these multiple realizations of q*n as our estimate of x for qn.

The Bootstrap: Computational Issues Seemingly a wonderful match to modern parallel and distributed computing platforms But the expected number of distinct points in a bootstrap resample is ~ 0.632n e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB Can’t feasibly send resampled datasets of this size to distributed servers Even if one could, can’t compute the estimate locally on datasets this large

Subsampling (Politis, Romano & Wolf, 1999) n

Subsampling n b

Subsampling There are many subsets of size b < n Choose some sample of them and apply the estimator to each This yields fluctuations of the estimate, and thus error bars But a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (they’ll be too large) Need to analytically correct the error bars

Subsampling Summary of algorithm: Repeatedly subsample b < n points without replacement from the original dataset of size n Compute q*b on each subsample Compute x based on these multiple realizations of q*b Analytically correct to produce final estimate of x for qn The need for analytical correction makes subsampling less automatic than the bootstrap Still, much more favorable computational profile than bootstrap Let’s try it out in practice…

Empirical Results: Bootstrap and Subsampling Multivariate linear regression with d = 100 and n = 50,000 on synthetic data. x coordinates sampled independently from StudentT(3). y = wTx + e, where w in Rd is a fixed weight vector and e is Gaussian noise. Estimate qn = wn in Rd via least squares. Compute a marginal confidence interval for each component of wn and assess accuracy via relative mean (across components) absolute deviation from true confidence interval size. For subsampling, use b(n) = ng for various values of g. Similar results obtained with Normal and Gamma data generating distributions, as well as if estimate a misspecified model.

Empirical Results: Bootstrap and Subsampling

Bag of Little Bootstraps I’ll now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds

Bag of Little Bootstraps I’ll now discuss a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms

Bag of Little Bootstraps I’ll now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesn’t require analytical rescaling

Bag of Little Bootstraps I’ll now present a new procedure that combines the bootstrap and subsampling, and gets the best of both worlds It works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesn’t require analytical rescaling And it’s successful in practice

Towards the Bag of Little Bootstraps n b

Towards the Bag of Little Bootstraps

Approximation

Pretend the Subsample is the Population

Pretend the Subsample is the Population And bootstrap the subsample! This means resampling n times with replacement, not b times as in subsampling

The Bag of Little Bootstraps (BLB) The subsample contains only b points, and so the resulting empirical distribution has its support on b points But we can (and should!) resample it with replacement n times, not b times Doing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary! Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging)

The Bag of Little Bootstraps (BLB)

Bag of Little Bootstraps (BLB) Computational Considerations A key point: Resources required to compute q generally scale in number of distinct data points This is true of many commonly used estimation algorithms (e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.) Use weighted representation of resampled datasets to avoid physical data replication Example: if original dataset has size 1 TB with each data point 1 MB, and we take b(n) = n0.6, then expect subsampled datasets to have size ~ 4 GB resampled datasets to have size ~ 4 GB (in contrast, bootstrap resamples have size ~ 632 GB)

Empirical Results: Bag of Little Bootstraps (BLB)

Empirical Results: Bag of Little Bootstraps (BLB)

BLB: Theoretical Results Higher-Order Correctness Then: Note that m_1 can grow significantly more slowly than n here (see discussion in paper)

BLB: Theoretical Results BLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that q is Hadamard differentiable and x is continuous), the output of BLB converges to the population value of x as n, b approach ∞.

BLB: Theoretical Results Higher-Order Correctness Assume: q is a studentized statistic. x(Qn(P)), the population value of x for qn, can be written as where the pk are polynomials in population moments. The empirical version of x based on resamples of size n from a single subsample of size b can also be written as where the are polynomials in the empirical moments of subsample j. b ≤ n and

BLB: Theoretical Results Higher-Order Correctness Also, if BLB’s outer iterations use disjoint chunks of data rather than random subsamples, then

Part II: Large-Scale Matrix Completion and Matrix Concentration with Lester Mackey and Ameet Talwalkar University of California, Berkeley and Richard Chen, Brendan Farrell and Joel Tropp Caltech