Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides
Signal processing and Networking for Big Data Applications Lecture 10: Sublinear Algorithm Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

outline Motivations Inequalities and classifications Examples
Applications

Motivation for Sublinear-Time Algorithms
Massive datasets world-wide web online social networks genome project sales logs census data high-resolution images scientific measurements Long access time communication bottleneck (slow connection) implicit data (an experiment per data point)

What Can We Hope For? What can an algorithm compute if it
reads only a sublinear portion of the data? runs in sublinear time? Some problems have exact deterministic solutions For most interesting problems algorithms must be approximate randomized Quality of approximation Resources number of queries running time

Types of Approximation
Classical approximation need to compute a value output should be close to the desired value example: average Property testing need to answer YES or NO Intuition: only require correct answers on two sets of instances that are very different from each other In cases when we need to compute some value, it is clear what we mean by "approximation". The output should be close to the desired value. This is a classical notion, and everybody has heard of approximating the average and median values by sampling.

Why is it useful Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) Networking applications (counting and detecting patterns in small space) Distributed computations (small sketches to reduce communication overheads) Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M Today: Applications to soccer

Puzzles 5 1 8 11 9 7 6 3 4 2

Which number was missing?

Puzzle #1

Puzzle #2 (google interview Question)

Answers to the puzzles Uniform probability for each sample even it is i>s

Applications

Inequalities Markov inequality Chebyshev inequality Chernoff bound

Markov’s Inequality

Markov Inequality: Example

Markov + Union Bound: Example

Chernoff bound

Chernoff bound (corollary)

Chernoff: Example

Sublinear Algorithms Classification

Applications

A Housewife Example Assume that there is a group of people who can be classified into different categories. One category is the housewife. We want to know the percentage of the housewife in this group, but the group is too big to examine every person. A simple way is to sample a subset of people and see how many of these people in it belong to the housewife group. This is where the question arise: how many samples are enough?

A Housewife Example Not a function of data size!

A Housewife Example

A Two Cat Problem Deterministic Algorithm

A Two Cat Problem

A Two Cat Problem 1,3,6,10,15,21,28 Total number is square root of n, between the number of two samples is also square root of n When you have two pieces of resources, split them even.

Applications

Pricing and Sublinear Algorithms: Motivation
Overall picture:

Pricing and Sublinear Algorithms
Objectives: Design a differentiating user services model for profit gain computing based on different types of users Enable the services model staying efficient in big data context with performance guarantees Underlying philosophy: classify users first and then use corresponding typical user behavior instead of actual user usage as the approximation and estimation Advantages: Able to perform prediction Fast computation speed Save storage capacity

Pricing and Sublinear Algorithms: Pricing Model
Differentiating user service model: Simplify into 2 types of users in total, i.e., L=2: user type indicator load profiling expectation of m-th type user N: # of users L: # of user types total bill gain bill charge for typical m-th type user

Pricing and Sublinear Algorithms: Pricing Model
Model the expense: Total net profit gain: Xij: i-th user energy usage at time instant j ap: cost coeff to buy energy at peak hour ao: cost coeff to buy energy at off-peak hour

Classify users to compute α and β: Algorithm quality:

Sublinear on percentage calculation: “no need of every user for the computation” Not a function of N, complexity O(1)

Sublinear on classification/distribution comparison: “no need of every data points for the comparison” Existent sublinear algorithm for L2-distance test:

Drawbacks: confidence remains undetermined when the L2-distance of two testing distributions is truly in interval [ε2/2, ε2] Proposed solution: utilize the existent algorithm twice

1> Employ the traditional sublinear sampling and obtain labeled results as set {S1} 2> Employ the traditional sublinear sampling with twice larger of the error bound and obtain labeled results as set {S1} 3> Keep the labeled 1 in {S1} and reject all the labeled 2 4> Keep the labeled 2 in {S2} and reject all the labeled 1 5> Combine the retained labels into {S3}: if the same user is both labeled as 1 in {S1} and 2 in {S2}, his/her label is randomly decided 6> Output {S3} as the final classification results

Overall algorithm flow: Call AlgoPercent() to sample a small portion of users for classification Call AlgoDist() to sample a small portion of each user’s distribution data points.

Pricing and Sublinear Algorithms: Numerical Results
Bounded error vs. different parameterizations: Estimation errors vs. number of sub-sampling data points from the entire distribution Performance on estimating α

Profit gains vs. other pricing plans; reduced computation burdens: Net profits from different pricing strategies Reduced data amount vs. overall confidence parameter

reduced computation burdens vs. varying parameters and error & confidence settings: Reduced data amount vs. overall error bound parameter

summary Sublinear algorithms are much more efficient than linear algorithms for massive data sets A good sample strategy is needed Many applications in graph theory

Reference Slides from Dr. Ronnit Rubinfeld’s website
Slides from Dr. Dana Ron’s website D. Wong, Y. Long, F. Ergun, “A layered architecture for delay sensitive sensor networks” Dan Wang and Zhu Han, “Sublinear Algorithms for Big Data Applications,” Springer, 2015s

Thanks

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Similar presentations

Presentation on theme: "Zhu Han University of Houston Thanks for Professor Dan Wang’s slides"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Similar presentations

Presentation on theme: "Zhu Han University of Houston Thanks for Professor Dan Wang’s slides"— Presentation transcript:

Similar presentations

About project

Feedback