Zhu Han University of Houston Thanks for Professor Dan Wang’s slides Signal processing and Networking for Big Data Applications Lecture 10: Sublinear Algorithm Zhu Han University of Houston Thanks for Professor Dan Wang’s slides
outline Motivations Inequalities and classifications Examples Applications
Motivation for Sublinear-Time Algorithms Massive datasets world-wide web online social networks genome project sales logs census data high-resolution images scientific measurements Long access time communication bottleneck (slow connection) implicit data (an experiment per data point)
What Can We Hope For? What can an algorithm compute if it reads only a sublinear portion of the data? runs in sublinear time? Some problems have exact deterministic solutions For most interesting problems algorithms must be approximate randomized Quality of approximation Resources number of queries running time
Types of Approximation Classical approximation need to compute a value output should be close to the desired value example: average Property testing need to answer YES or NO Intuition: only require correct answers on two sets of instances that are very different from each other In cases when we need to compute some value, it is clear what we mean by "approximation". The output should be close to the desired value. This is a classical notion, and everybody has heard of approximating the average and median values by sampling.
Why is it useful Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) Networking applications (counting and detecting patterns in small space) Distributed computations (small sketches to reduce communication overheads) Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M Today: Applications to soccer
Puzzles 5 1 8 11 9 7 6 3 4 2
Which number was missing?
Puzzle #1
Puzzle #2 (google interview Question)
Answers to the puzzles Uniform probability for each sample even it is i>s
outline Motivations Inequalities and classifications Examples Applications
Inequalities Markov inequality Chebyshev inequality Chernoff bound
Markov’s Inequality
Markov Inequality: Example
Markov Inequality: Example
Markov Inequality: Example
Markov + Union Bound: Example
Chernoff bound
Chernoff bound (corollary)
Chernoff: Example
Chernoff: Example
Sublinear Algorithms Classification
outline Motivations Inequalities and classifications Examples Applications
A Housewife Example Assume that there is a group of people who can be classified into different categories. One category is the housewife. We want to know the percentage of the housewife in this group, but the group is too big to examine every person. A simple way is to sample a subset of people and see how many of these people in it belong to the housewife group. This is where the question arise: how many samples are enough?
A Housewife Example Not a function of data size!
A Housewife Example
A Two Cat Problem Deterministic Algorithm
A Two Cat Problem
A Two Cat Problem 1,3,6,10,15,21,28 Total number is square root of n, between the number of two samples is also square root of n When you have two pieces of resources, split them even.
outline Motivations Inequalities and classifications Examples Applications
Pricing and Sublinear Algorithms: Motivation Overall picture:
Pricing and Sublinear Algorithms Objectives: Design a differentiating user services model for profit gain computing based on different types of users Enable the services model staying efficient in big data context with performance guarantees Underlying philosophy: classify users first and then use corresponding typical user behavior instead of actual user usage as the approximation and estimation Advantages: Able to perform prediction Fast computation speed Save storage capacity
Pricing and Sublinear Algorithms: Pricing Model Differentiating user service model: Simplify into 2 types of users in total, i.e., L=2: user type indicator load profiling expectation of m-th type user N: # of users L: # of user types total bill gain bill charge for typical m-th type user
Pricing and Sublinear Algorithms: Pricing Model Model the expense: Total net profit gain: Xij: i-th user energy usage at time instant j ap: cost coeff to buy energy at peak hour ao: cost coeff to buy energy at off-peak hour
Pricing and Sublinear Algorithms Classify users to compute α and β: Algorithm quality:
Pricing and Sublinear Algorithms Sublinear on percentage calculation: “no need of every user for the computation” Not a function of N, complexity O(1)
Pricing and Sublinear Algorithms Sublinear on classification/distribution comparison: “no need of every data points for the comparison” Existent sublinear algorithm for L2-distance test:
Pricing and Sublinear Algorithms Drawbacks: confidence remains undetermined when the L2-distance of two testing distributions is truly in interval [ε2/2, ε2] Proposed solution: utilize the existent algorithm twice
Pricing and Sublinear Algorithms 1> Employ the traditional sublinear sampling and obtain labeled results as set {S1} 2> Employ the traditional sublinear sampling with twice larger of the error bound and obtain labeled results as set {S1} 3> Keep the labeled 1 in {S1} and reject all the labeled 2 4> Keep the labeled 2 in {S2} and reject all the labeled 1 5> Combine the retained labels into {S3}: if the same user is both labeled as 1 in {S1} and 2 in {S2}, his/her label is randomly decided 6> Output {S3} as the final classification results
Pricing and Sublinear Algorithms Overall algorithm flow: Call AlgoPercent() to sample a small portion of users for classification Call AlgoDist() to sample a small portion of each user’s distribution data points.
Pricing and Sublinear Algorithms: Numerical Results Bounded error vs. different parameterizations: Estimation errors vs. number of sub-sampling data points from the entire distribution Performance on estimating α
Pricing and Sublinear Algorithms: Numerical Results Profit gains vs. other pricing plans; reduced computation burdens: Net profits from different pricing strategies Reduced data amount vs. overall confidence parameter
Pricing and Sublinear Algorithms: Numerical Results reduced computation burdens vs. varying parameters and error & confidence settings: Reduced data amount vs. overall error bound parameter
summary Sublinear algorithms are much more efficient than linear algorithms for massive data sets A good sample strategy is needed Many applications in graph theory
Reference Slides from Dr. Ronnit Rubinfeld’s website http://people.csail.mit.edu/ronitt/sublinear.html Slides from Dr. Dana Ron’s website http://www.eng.tau.ac.il/~danar/talks.html D. Wong, Y. Long, F. Ergun, “A layered architecture for delay sensitive sensor networks” http://www.cse.psu.edu/~sxr48/ Dan Wang and Zhu Han, “Sublinear Algorithms for Big Data Applications,” Springer, 2015s
Thanks