Download presentation
Presentation is loading. Please wait.
1
A Scalable Bootstrap for Massive Data
Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan
2
Why bootstrap? Made it possible to use computers not only to compute estimates but also to assess the quality of estimators Provides a simple and powerful means of assessing the quality of an estimator. Such quality assessments provide more information than a simple point estimate. The results are generally consistent (based on asymptotic theory, as πββ), and often more accurate than those based on asymptotic approximations.
3
A New Setting Two recent trends:
Accelerated growth in size of data sets (βmassiveβ data) Computational resources are shifting towards parallel and distributed architecture (multicore, cloud platforms) From an inferential point of view, it is not yet clear how statistical methodology will transport to a world involving massive data on parallel and distributed computing platforms. However, there still remains the core inferential need to asses the quality of estimators. The uncertainty and biases in estimates based on large data can remain quite significant, as large datasets are often high dimensional, and can have many potential sources of bias. In the new setup, even if sufficient data are available to allow highly accurate estimation, by efficiently assess estimator quality we can allow efficient use of available resources by processing only as much data as necessary. The bootstrap brings to bear various desirable features in the massive data setting, notably its relatively automatic nature and its applicability to a wide variety of inferential problems. Massive data may often motivate considering a wide range of models and estimators β enhancing the need for control over biases and variance.
4
Why Classic Bootstrap is Problematic
Recall β bootstrap-based quantities typically computed when the estimator in question is repeatedly applied to resample of the entire original observed data set, with the size of the order of that of the original data set, and with approximately 63% of data points appearing at least once in each resample. In the massive data setting, computation of even a single point estimate on the full data can be quite computationally demanding. Can we use parallel computing? the large size of bootstrap resamples in the massive data setting renders this approach problematic, as the cost of transferring data to independent processors or compute nodes can be overly high. Meaning, thereβs a feature that quantify the estimatorβs quality, and it has some distribution that depends on the original distribution π and the empirical distribution of π observations from π.
5
Notation We observe π 1 ,β¦, π π drawn IID from some unknown distribution π, and forming an estimate π π based on the empirical distribution of the data β π . We are interested in assessment of the quality of π π , π π¬ π π ,π , which is a quantity of π¬ π (π), the distribution of π π and the underlying distribution π. For instance, π π¬ π π =πππ π β π , or π π¬ π π =πΈ π π β π βπ π . Under this notation, the bootstrap simply computes the data-driven plugin approximation π π¬ π π βπ π¬ π β , through the empirical distribution β π β of the repeatedly computed π π β . Nonetheless, given knowledge of Qn(P), any direct dependence of ΞΎ on P generally has a simple form, often only involving the parameter ΞΈ
6
Previous Solutions The discussion of techniques for improving the computational efficiency of the bootstrap is largely devoted to reduc the sample size of those resamples. When using resamples of size π<π, we need to take into account that we are implicitly changing our goal from estimating features of π¬ π to estimating features of π¬ π . Moreover, these methodsβ success is sensitive to the choice of the resample size π.
7
Subsampling & π out of π Bootstrap
For any positive integer π, let the bootstrap sample, π 1 β , , π π β , be a sample drawn from β π , and denote the m-bootstrap version of π π by π π β , with distribution π¬ π π . Assuming π π π π βπ π πΉ, for any πβ€π, bootstrap samples may be drawn with or without replacement. If it is done without replacement (known as subsampling), then, π π π π β βπ π πΉ under minimal conditions. If the resampling is done with replacement (π out of π Bootstrap), then this limit holds if, in addition to the minimal conditions, π π β is not affected much by the order of π ties. The resulting estimator should be rescaled by π π / π π . Thus, subsampling is more general than the π out of π bootstrap since fewer assumptions are required. However, the π out of π bootstrap has the advantage that it allows for the choice of π = π. In that case, unlike subsampling, the π out of π bootstrap enjoys the second order properties of the π-bootstrap. The only problem is that the distribution of ΞΈ β b β ΞΈ differs from that of ΛΞΈn β ΞΈ by a factor of Οb/Οn. For example, the function π π π β - when m, n β β, β π( π π β β πΒ― π) βπ(0, π2 ) (Bickel and Freedman (1981)). Hence, β mXΒ― β m behaves like N( β mXΒ― n, Ο2 ).
8
Bag of Little Bootstraps
Given a subset size π<π, we sample without replacement π subsets (not necessarily disjoint) of size π from the original π data points. Let β 1 ,β¦, β π β 1,β¦,π be the corresponding index sets ( β π =π,βπ), and let β π,π (π) denote the empirical distribution corresponding to subset π. For each π, repeatedly resample π points IID from β π,π (π) , to form the empirical distribution β π,π β , and compute π β π,π β for each resample. Form the empirical distribution β π,π β of the computed π βs and compute π β π,π β β π π¬ π β π,π (π) . The BLBβs estimate of π π¬ π π is then given by π β1 π=1 π π π¬ π β π,π (π) .
9
π β π β π,π (1) β π,π (π ) β π,π β β π,1 β β π,1 β β π,π β β π,1 β
π β π ~ π¬ π π βπ π¬ π π Sample π obs. β π Sample π obs. w. out replacement β π,π (1) β π,π (π ) Sample π obs. w. replacement π times β π,1 β β π,π β β π,1 β β π,π β Compute π β π,1 β π β π,π β π β π,1 β π β π,π β β π,1 β β π,π β Compute π β π,1 β π β π,π β π π¬ π π = π β1 π=1 π π π¬ β π, π π
10
Computational Benefits
Each BLB sample, despite having nominal size π, contains at most π distinct data points. To generate each resample, it suffices to draw a vector of counts from an π-trial uniform multinomial distribution over π objects. We can then represent each resample by the at-most π distinct data points within it, and the corresponding sampled counts. Therefor, each resampled requires only storage space in π(π). If the estimator π can work directly with this weighted data representation, then the computational requirements of the estimator β with respect to both time and storage space β scale only in π, rather than π. (Thatβs the case for most commonly used estimators, such as general π-estimators) Parallel computation?
11
Consistency of the BLB The BLB has statistical properties which are identical to those of the bootstrap, under the same conditions that have been used in prior analysis of the bootstrap. Theorem 1. Suppose that π π β π =π β π and π π =π π , where π is Hadamard differentiable at π tangentially to some subspace, with π, β π and β π,π (π) viewed as maps from some Donsker class β± to β. Additionally, assume that π π¬ π π is a function of the distribution with respect to a metric that metrizes weak convergence. Then, π β1 π=1 π π π π β π,π (π) βπ π¬ π π π 0 As πββ, for any sequence πββ and for any fixed π . Hadamard differentiable β functional π:π·βπΈ is Hadamard differentiable at πβπ· if exists π β² :π·βπΈ continuouse and linear satisfying lim π‘β0 ||β π‘ ββ||β0 π π+ β π‘ π‘ βπ(π) π‘ β π β² π, β π‘ lim π‘β0 ||β π‘ ββ||β0 π π+ β π‘ π‘ βπ(π) π‘ β π β² π, β π‘ β0 βββπ·. Meaning β Gateaux requires the difference quotients to converge to some πβ²(π,β) for each direction h; Hadamard requires a single π β² π,β that works for every direction h. The use of Hadamard differentiability for estimating from empirical distribution - Meaning, as π,πββ, the estimates π β1 π=1 π π π π β π,π (π) returned by the BLB approach the population value π π¬ π π in probability. The empirical process: π β π π₯ βπ π₯ In Donsker class: π β π π₯ βπ π₯ π· π Where π is a standard Brownian bridge process. This gives us that all of the functions in the class converges to the same process. Moreover, for any bounded, continuous function π:πββ, πΈ π π β π π₯ ββ π₯ π πΈ πβ²(π) π π β π π₯ ββ π₯ π· πβ²(π) The hadamard assumption provides the form of the random element to which π π β π βπ π converges
12
Rate of Convergence of the BLB
Prior work has been devoted to showing that the bootstrap is higher order correct in many cases, converging to the true value π π¬ π π at a rate of π π 1/π . The BLB shares the same degree of higher order correctness, assuming that π and π are chosen to be sufficiently large. Importantly, sufficiently large values of π here can still be significantly smaller than π, with π π β0 as πββ. Theorem 2. Suppose that π π¬ π π admits an expansion as an asymptotic series π π¬ π π =π+ π 1 π +β¦+ π π π π 2 +π 1 π π 2 , Where π is a constant independent of π and π π are polynomials in the moments of π. Additionally, assume that the empirical version of π π¬ π π for any π admits a similar expansion. Then π β1 π=1 π π π π β π,π (π) βπ π¬ π π = π π 1 π Meaning that the asymptotic error of the estimate is a term of order π β1 The notation, π π = π π π π means that the set of valuesΒ π π / π π Β is stochastically bounded. That is, for any Ξ΅>0, there exists a finite π>0 and a finite π>0 such that, π π π / π π >π <Ξ΅ π π π >π| π π | <Ξ΅
13
Simulation Results (Regression & Classification)
The simulated data: π π , π π ~π IID for π=1,β¦,π. π π β β π . π π ββ for regression, π π β 0,1 for classification. In each case, π π β β π for a linear or generalized linear model between π π and π π . π is a procedure that computes a set of marginal 95% confidence intervals, one for each component of π π . Therefore the true π is the 2.5th and 97.5th percentiles of the marginal componentwise distributions defined by π¬ π π . Given an estimated marginal CI width π, and a true width π 0 , the relative deviation is defined as πβ π 0 π 0 . In the regression setting, either π π = π π T π d + π π or π π = π π T π d + π π T π π + π π , with π=100. For the classification setting, either π π ~π΅πππππ’πππ 1+ exp β π π T π d β1 or π π ~π΅πππππ’πππ 1+ exp β π π T π d β π π T π π β exp β π π T π d β π π T π π β1 , with π=20. In both settings, π=20,000. For each method, π= π πΎ , πΎβ 0.5,0.6,0.7,0.8,0.9 , π=100. Note that we evaluate based on confidence interval widths, rather than coverage probabilities, to control the running times of our experiments: in our experimental setting, even a single run of a quality assessment procedure requires non-trivial time, and computing coverage probabilities would require a large number of such runs.
14
Regression Setting: Left column β linear data-generating, with π π,π , π π ~πΊππππ Middle column β quadratic data generating, with π π,π , π π ~πΊππππ Right column β linear data-generating, with π π,π ~ππ‘π’ππππ‘π 3 , and π π ~π(0,10) BLB always converge faster than the Bootstrap BLB isnβt sensitive as the BOFN and SS methods to the value of π (or πΎ) The quadratic model was a little bit more challenging for the BLB The value of π required for convergence of the BLB is 1-2 for π= π 0.9 and up to 10β14 for π= π 0.5
15
Classification Setting:
Left column β linear data-generating, with π π,π ~πΊππππ Middle column β quadratic data generating, with π π,π ~πΊππππ Right column β linear data-generating, with π π,π ~ππ‘π’ππππ‘π 3 The case of linear data-generating, with π π,π ~πΊππππ appears to be most challenging. For πβ€ π 0.6 the BLB fails to converge to the bootstrap relative error. In every scenario, the BLB is more robust than the BOFN The value of π required for convergence of the BLB is 1-2 for π= π 0.9 and up to 10β20 for πβ€ π 0.6
16
Relative Error VS π Left column β Classification with linear data-generating, with π π,π ~πΊππππ Right column β Classification linear data-generating, with π π,π ~ππ‘π’ππππ‘π 3 As expected, BLBβs relative error here is higher than that of the bootstrap for the smallest values of π and π considered. Nonetheless, BLBβs relative error decreases to that of the bootstrap as π increasesβfor all considered values of πΎ. BLBβs relative error is consistently substantially lower than that of the π out of π bootstrap
17
Computational Scalability
when computing on a single processor, BLB generally requires less time, and hence less total computation, than the bootstrap to attain comparably high accuracy. Those results only hint at BLBβs superior ability to scale computationally to large datasets through parallel computing architecture. The following is the most natural avenue for applying the bootstrap to large-scale data using distributed computing: given data partitioned across a cluster of compute nodes, parallelize the estimate computation on each resample across the cluster, and compute on one resample at a time. Each computation of the estimate will require the use of an entire cluster of compute nodes. In contrast, BLB permits computation on multiple (or even all) subsamples and resamples simultaneously in parallel. Because BLB resamples can be significantly smaller than the original dataset, they can be transferred to, stored by, and processed independently on individual (or very small sets of) compute nodes.
18
BLB Bootstrap Compute Nodes Cluster Compute Nodes Cluster β π,π (2)
β π,π (6) β π,π (4) β π,π (1) β β π (π) β π,π (3) β π,π (5) β π,1 β β π,3 β β π,5 β π β β π (π) β π,2 β β π,4 β β π,6 β π=1,β¦,π΅
19
Choosing π and π The bottom plot shows the relative error achieved by BLB for different values of π and π , with π= π For all but the smallest values of π and π (πβ₯50, π β₯5), it is possible to choose these values independently such that BLB achieves low relative error. The upper plot shows relative error vs. processing time (without parallelization) for BLB using adaptive selection of π and π (the resulting stopping times of the BLB trajectories are marked by large squares). Both plots are from classification setting with linear data-generating, with π π,π ~ππ‘π’ππππ‘π 3 Concretely, to select r adaptively in the inner loop of Algorithm 1, we propose an iterative scheme whereby, for any given subsample j, we continue to process resamples and update ΞΎ β n,j until it has ceased to change significantly. The same scheme can be used to select s adaptively by processing more subsamples (i.e., increasing s) until BLBβs output value s β1 Ps j=1 ΞΎ β n,j has stabilized; in this case, one can simultaneously also choose r adaptively and independently for each subsample. smaller values of r are selected when ΞΎ is easier to compute
20
Real Data Results for BLB, the bootstrap, and the b out of n bootstrap on the UCI connect4 dataset β logistic regression with π= 42, π=67,557 Notably, the outputs of BLB for all values of π considered, and the output of the bootstrap, are tightly clustered around the same value. Additionally, as expected, BLB converges more quickly than the bootstrap. However, the values produced by the π out of π bootstrap vary significantly as π changes β further highlighting this procedureβs lack of robustness.
21
Time series To extend BLB for time series data, the authors suggest an adaptation of βstationary bootstrapβ method (Politis, Romano 1994): It suffices to select each subsample as a (uniformly) randomly positioned block of length π within the observed time series of length π. Given a subsample of size π, we generate each resample in the following way: given πβ[0, 1], we first select uniformly at random a data point in the subsample series. With probability 1βπ we append to our resample the next point in the subsample series, and with probability π we (uniformly at random) select and append a new point in the subsample series. π 1 π 2 β¦ π πβ1 π π a strictly stationary stochastic process is one where given t1, . . ., tβ the joint statistical distribution of Xt1 , . . ., Xtβ is the same as the joint statistical distribution of Xt1+Ο , . . ., Xtβ+Ο for all β and Ο π πβ1 β¦ π π+π Subsample π 1 β β¦ π π β π π β
22
Simulation β Time Series
The generated stationary time series: π 1 ,β¦, π π ββ where π π‘ = π π‘ + π π‘β1 + π π‘β2 + π π‘β3 + π π‘β4 ; π π‘ ~π 0,1 , π=5,000. π = π‘=1 π π π‘ / π , and π π¬ π π =ππ· π (with true value π π¬ π π β5). The following table shows the results for estimating π π¬ π π , using bootstrap, stationary bootstrap, BLB and stationary BLB: This exploration of stationary BLB is intended as a proof of concept, and additional investigation would help to further elucidate and perhaps improve the performance characteristics of this BLB extension
23
Conclusion The BLB procedure gives us an alternative for assessment of estimator quality that shares the favorable statistical properties (i.e., consistency and higher-order correctness) and generic applicability of the bootstrap. The clear advantage of the method is that it is well suited to large-scale data and modern parallel and distributed computing architectures. Additionally, BLB is consistently more robust than the π out of π bootstrap and subsampling to the choice of subset size and does not require the use of analytical corrections.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.