Big Data, Computation and Statistics Michael I. Jordan February 23, 2012 1.

Slides:



Advertisements
Similar presentations
Computational and Statistical Tradeoffs in Inference for “Big Data”
Advertisements

Divide-and-Conquer and Statistical Inference for Big Data
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
What is the Big Data Problem and How Do We Take Aim at It? Michael I. Jordan University of California, Berkeley October 25, 2012.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Evaluation.
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluation.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Sampling Distributions
Evaluating Hypotheses
Bootstrapping in regular graphs
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Bootstrapping LING 572 Fei Xia 1/31/06.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Inference about a Mean Part II
Experimental Evaluation
Bootstrapping a Heteroscedastic Regression Model with Application to 3D Rigid Motion Evaluation Bogdan Matei Peter Meer Electrical and Computer Engineering.
Sampling Theory Determining the distribution of Sample statistics.
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Determining Sample Size
Chapter Thirteen Part I
Target Tracking with Binary Proximity Sensors: Fundamental Limits, Minimal Descriptions, and Algorithms N. Shrivastava, R. Mudumbai, U. Madhow, and S.
Estimation of Statistical Parameters
Introduction to Statistical Inference Chapter 11 Announcement: Read chapter 12 to page 299.
Biostatistics IV An introduction to bootstrap. 2 Getting something from nothing? In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Section 10.1 Confidence Intervals
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Robust Estimators.
Inferences from sample data Confidence Intervals Hypothesis Testing Regression Model.
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
Bootstrap Event Study Tests Peter Westfall ISQS Dept. Joint work with Scott Hein, Finance.
Classification Ensemble Methods 1
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
Green Belt – SIX SIGMA OPERATIONAL Central Limit Theorem.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Canadian Bioinformatics Workshops
Multiple Regression Analysis: Inference
Estimating standard error using bootstrap
A Scalable Bootstrap for Massive Data
When we free ourselves of desire,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Big Data, Computation and Statistics Michael I. Jordan February 23,

What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used both for classical testing of theories and for exploratory science Measurement of human activity, particularly online activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets Sensor networks are becoming pervasive

What Is the Big Data Problem? Computer science studies the management of resources, such as time and space and energy

What Is the Big Data Problem? Computer science studies the management of resources, such as time and space and energy Data has not been viewed as a resource, but as a “workload”

What Is the Big Data Problem? Computer science studies the management of resources, such as time and space and energy Data has not been viewed as a resource, but as a “workload” The fundamental issue is that data now needs to be viewed as a resource –the data resource combines with other resources to yield timely, cost-effective, high-quality decisions and inferences

What Is the Big Data Problem? Computer science studies the management of resources, such as time and space and energy Data has not been viewed as a resource, but as a “workload” The fundamental issue is that data now needs to be viewed as a resource –the data resource combines with other resources to yield timely, cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first order) that the more of the data resource the better

What Is the Big Data Problem? Computer science studies the management of resources, such as time and space and energy Data has not been viewed as a resource, but as a “workload” The fundamental issue is that data now needs to be viewed as a resource –the data resource combines with other resources to yield timely, cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first order) that the more of the data resource the better –is that true in our current state of knowledge?

No, for two main reasons: –query complexity grows faster than number of data points the more rows in a table, the more columns the more columns, the more hypotheses that can be considered indeed, the number of hypotheses grows exponentially in the number of columns so, the more data the greater the chance that random fluctuations look like signal (e.g., more false positives)

No, for two main reasons: –query complexity grows faster than number of data points the more rows in a table, the more columns the more columns, the more hypotheses that can be considered indeed, the number of hypotheses grows exponentially in the number of columns so, the more data the greater the chance that random fluctuations look like signal (e.g., more false positives) –the more data the less likely a sophisticated algorithm will run in an acceptable time frame and then we have to back off to cheaper algorithms that may be more error-prone or we can subsample, but this requires knowing the statistical value of each data point, which we generally don’t know a priori

A Formulation of the Problem Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

A Formulation of the Problem Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) –this is far from being achieved in the current state of the literature! –a solution will blend computer science and statistics, and fundamentally change both fields

A Formulation of the Problem Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) –this is far from being achieved in the current state of the literature! –a solution will blend computer science and statistics, and fundamentally change both fields I believe that it is an achievable goal, but one that will require a major, sustained effort

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set –seemingly a wonderful connection with modern multi-core and cloud computing architectures

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set –seemingly a wonderful connection with modern multi-core and cloud computing architectures But if we have a terabyte of data, each resampled data set is 632 gigabytes… –and error bars computed on subsamples have the wrong scale

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set –seemingly a wonderful connection with modern multi-core and cloud computing architectures But if we have a terabyte of data, each resampled data set is 632 gigabytes… –and error bars computed on subsamples have the wrong scale So the classical bootstrap is dead in the water for massive data

An Example The bootstrap is a generic statistical method for computing error bars (and other measures of risk) for statistical estimates It requires resampling the data with replacement a few hundred times, and computing the estimator independently on each resampled data set –seemingly a wonderful connection with modern multi-core and cloud computing architectures But if we have a terabyte of data, each resampled data set is 632 gigabytes… –and error bars computed on subsamples have the wrong scale So the classical bootstrap is dead in the water for massive data –this issue hasn’t been considered before in the statistics literature

The Big Data Bootstrap Ariel Kleiner, Purnamrita Sarkar and Ameet Talwalkar and Michael I. Jordan University of California, Berkeley

Assessing the Quality of Inference Observe data X 1,..., X n Form a parameter estimate  n =  (X 1,..., X n ) Want to compute an assessment  of the quality of our estimate  n (e.g., a confidence region)

The Goal A procedure for quantifying estimator quality which is accurate automatic scalable

The Unachievable Frequentist Ideal Ideally, we would ① Observe many independent datasets of size n. ② Compute  n on each. ③ Compute  based on these multiple realizations of  n.

The Unachievable Frequentist Ideal Ideally, we would ① Observe many independent datasets of size n. ② Compute  n on each. ③ Compute  based on these multiple realizations of  n.

The Unachievable Frequentist Ideal Ideally, we would ① Observe many independent datasets of size n. ② Compute  n on each. ③ Compute  based on these multiple realizations of  n. But, we only observe one dataset of size n.

The Underlying Population

Sampling

Approximation

Pretend The Sample Is The Population

The Bootstrap Use the observed data to simulate multiple datasets of size n: ① Repeatedly resample n points with replacement from the original dataset of size n. ② Compute  * n on each resample. ③ Compute  based on these multiple realizations of  * n as our estimate of  for  n. (Efron, 1979)

The Bootstrap: Computational Issues Expected number of distinct points in a bootstrap resample is ~ 0.632n. Resources required to compute  generally scale in number of distinct data points. –this is true of many commonly used algorithms (e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.). –use weighted representation of resampled datasets to avoid physical data replication. –example: If original dataset has size 1 TB, then expect resample to have size ~ 632 GB.

Subsampling n (Politis, Romano & Wolf, 1999)

Subsampling n b

b

Compute  only on smaller subsets of the data of size b(n) < n, and analytically correct our estimates of  : ① Repeatedly subsample b(n) < n points without replacement from the original dataset of size n. ② Compute   b(n) on each subsample. ③ Compute  based on these multiple realizations of   b(n). ④ Analytically correct to produce final estimate of  for  n. Much more favorable computational profile than bootstrap.

Empirical Results: Bootstrap and Subsampling Multivariate linear regression with d = 100 and n = 50,000 on synthetic data. x coordinates sampled independently from StudentT(3). y = w T x + , where w in R d is a fixed weight vector and  is Gaussian noise. Estimate  n = w n in R d via least squares. Compute a marginal confidence interval for each component of w n and assess accuracy via relative mean (across components) absolute deviation from true confidence interval size. For subsampling, use b(n) = n  for various values of . Similar results obtained with Normal and Gamma data generating distributions, as well as if estimate a misspecified model.

Empirical Results: Bootstrap and Subsampling

Subsampling n b

b

Approximation

Pretend the Subsample is the Population

And bootstrap the subsample! This means resampling n times with replacement, not b times as in subsampling

The Bag of Little Bootstraps (BLB) ① Repeatedly subsample b(n) < n points without replacement from the original dataset of size n. ② For each subsample do: a)Repeatedly resample n points with replacement from the subsample. b)Compute  * n on each resample. c)Compute an estimate of  based on these multiple resampled realizations of  * n. ③ We now have one estimate of  per subsample. Output their average as our final estimate of  for  n.

The Bag of Little Bootstraps (BLB)

Bag of Little Bootstraps (BLB) Computational Considerations A key point: Resources required to compute  generally scale in number of distinct data points. This is true of many commonly used estimation algorithms (e.g., SVM, logistic regression, linear regression, kernel methods, general M-estimators, etc.). Use weighted representation of resampled datasets to avoid physical data replication. Example: if original dataset has size 1 TB with each data point 1 MB, and we take b(n) = n 0.6, then expect subsampled datasets to have size ~ 4 GB resampled datasets to have size ~ 4 GB (in contrast, bootstrap resamples have size ~ 632 GB)

Empirical Results: Bag of Little Bootstraps (BLB)

BLB: Theoretical Results Higher-Order Correctness Then:

BLB: Theoretical Results BLB is asymptotically consistent and higher- order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that  is Hadamard differentiable and  is continuous), the output of BLB converges to the population value of  as n, b approach ∞.

BLB: Theoretical Results Higher-Order Correctness Assume:  is a studentized statistic.  (Q n (P)), the population value of  for  n, can be written as where the p k are polynomials in population moments. The empirical version of  based on resamples of size n from a single subsample of size b can also be written as where the are polynomials in the empirical moments of subsample j. b ≤ n and

BLB: Theoretical Results Higher-Order Correctness Also, if BLB’s outer iterations use disjoint chunks of data rather than random subsamples, then

Conclusions An initial foray into the evaluation of estimators for Big Data Many further directions to pursue –BLB for time series and graphs –streaming BLB –m out of n / BLB hybrids –super-scalability Technical Report available at: