Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Slides:



Advertisements
Similar presentations
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Advertisements

Visual Recognition Tutorial
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Evolutionary Computational Intelligence Lecture 10a: Surrogate Assisted Ferrante Neri University of Jyväskylä.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Evaluation.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.
Radial Basis Function Networks
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
COGNITIVE RADIO FOR NEXT-GENERATION WIRELESS NETWORKS: AN APPROACH TO OPPORTUNISTIC CHANNEL SELECTION IN IEEE BASED WIRELESS MESH Dusit Niyato,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
Classification Ensemble Methods 1
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Bootstrapped Optimistic Algorithm for Tree Construction
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Mining of Massive Datasets Edited based on Leskovec’s from
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Book web site:
Virtual University of Pakistan
Quantitative Methods for Business Studies
Big data classification using neural network
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Randomness and Computation
A paper on Join Synopses for Approximate Query Answering
Lecture 22: Linearity Testing Sparse Fourier Transform
Supervised Time Series Pattern Discovery through Local Importance
Mean Value Analysis of a Database Grid Application
Randomized Algorithms
Multimodal Learning with Deep Boltzmann Machines
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Spatial Online Sampling and Aggregation
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Statistical Methods For Engineers
Presenter: Xudong Zhu Authors: Xudong Zhu, etc.
Data Integration with Dependent Sources
Sequence comparison: Multiple testing correction
Randomized Algorithms
Discriminative Frequent Pattern Analysis for Effective Classification
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Ensemble learning.
Statistical Data Analysis
Linköping University, IDA, ESLAB
Designing Neural Network Architectures Using Reinforcement Learning
Lecture 6: Counting triangles Dynamic graphs & sampling
Minwise Hashing and Efficient Search
Chapter-5 Traffic Engineering.
Approximate Mean Value Analysis of a Database Grid Application
Presentation transcript:

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides Signal processing and Networking for Big Data Applications Lecture 10: Sublinear Algorithm Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

outline Motivations Inequalities and classifications Examples Applications

Motivation for Sublinear-Time Algorithms Massive datasets world-wide web online social networks genome project sales logs census data high-resolution images scientific measurements Long access time communication bottleneck (slow connection) implicit data (an experiment per data point)

What Can We Hope For? What can an algorithm compute if it reads only a sublinear portion of the data? runs in sublinear time? Some problems have exact deterministic solutions For most interesting problems algorithms must be approximate randomized Quality of approximation Resources number of queries running time

Types of Approximation Classical approximation need to compute a value output should be close to the desired value example: average Property testing need to answer YES or NO Intuition: only require correct answers on two sets of instances that are very different from each other In cases when we need to compute some value, it is clear what we mean by "approximation". The output should be close to the desired value. This is a classical notion, and everybody has heard of approximating the average and median values by sampling.

Why is it useful Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) Networking applications (counting and detecting patterns in small space) Distributed computations (small sketches to reduce communication overheads) Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M Today: Applications to soccer

Puzzles 5 1 8 11 9 7 6 3 4 2

Which number was missing?

Puzzle #1

Puzzle #2 (google interview Question)

Answers to the puzzles Uniform probability for each sample even it is i>s

outline Motivations Inequalities and classifications Examples Applications

Inequalities Markov inequality Chebyshev inequality Chernoff bound

Markov’s Inequality

Markov Inequality: Example

Markov Inequality: Example

Markov Inequality: Example

Markov + Union Bound: Example

Chernoff bound

Chernoff bound (corollary)

Chernoff: Example

Chernoff: Example

Sublinear Algorithms Classification

outline Motivations Inequalities and classifications Examples Applications

A Housewife Example Assume that there is a group of people who can be classified into different categories. One category is the housewife. We want to know the percentage of the housewife in this group, but the group is too big to examine every person. A simple way is to sample a subset of people and see how many of these people in it belong to the housewife group. This is where the question arise: how many samples are enough?

A Housewife Example Not a function of data size!

A Housewife Example

A Two Cat Problem Deterministic Algorithm

A Two Cat Problem

A Two Cat Problem 1,3,6,10,15,21,28 Total number is square root of n, between the number of two samples is also square root of n When you have two pieces of resources, split them even.

outline Motivations Inequalities and classifications Examples Applications

Pricing and Sublinear Algorithms: Motivation Overall picture:

Pricing and Sublinear Algorithms Objectives: Design a differentiating user services model for profit gain computing based on different types of users Enable the services model staying efficient in big data context with performance guarantees Underlying philosophy: classify users first and then use corresponding typical user behavior instead of actual user usage as the approximation and estimation Advantages: Able to perform prediction Fast computation speed Save storage capacity

Pricing and Sublinear Algorithms: Pricing Model Differentiating user service model: Simplify into 2 types of users in total, i.e., L=2: user type indicator load profiling expectation of m-th type user N: # of users L: # of user types total bill gain bill charge for typical m-th type user

Pricing and Sublinear Algorithms: Pricing Model Model the expense: Total net profit gain: Xij: i-th user energy usage at time instant j ap: cost coeff to buy energy at peak hour ao: cost coeff to buy energy at off-peak hour

Pricing and Sublinear Algorithms Classify users to compute α and β: Algorithm quality:

Pricing and Sublinear Algorithms Sublinear on percentage calculation: “no need of every user for the computation” Not a function of N, complexity O(1)

Pricing and Sublinear Algorithms Sublinear on classification/distribution comparison: “no need of every data points for the comparison” Existent sublinear algorithm for L2-distance test:

Pricing and Sublinear Algorithms Drawbacks: confidence remains undetermined when the L2-distance of two testing distributions is truly in interval [ε2/2, ε2] Proposed solution: utilize the existent algorithm twice

Pricing and Sublinear Algorithms 1> Employ the traditional sublinear sampling and obtain labeled results as set {S1} 2> Employ the traditional sublinear sampling with twice larger of the error bound and obtain labeled results as set {S1} 3> Keep the labeled 1 in {S1} and reject all the labeled 2 4> Keep the labeled 2 in {S2} and reject all the labeled 1 5> Combine the retained labels into {S3}: if the same user is both labeled as 1 in {S1} and 2 in {S2}, his/her label is randomly decided 6> Output {S3} as the final classification results

Pricing and Sublinear Algorithms Overall algorithm flow: Call AlgoPercent() to sample a small portion of users for classification Call AlgoDist() to sample a small portion of each user’s distribution data points.

Pricing and Sublinear Algorithms: Numerical Results Bounded error vs. different parameterizations: Estimation errors vs. number of sub-sampling data points from the entire distribution Performance on estimating α

Pricing and Sublinear Algorithms: Numerical Results Profit gains vs. other pricing plans; reduced computation burdens: Net profits from different pricing strategies Reduced data amount vs. overall confidence parameter

Pricing and Sublinear Algorithms: Numerical Results reduced computation burdens vs. varying parameters and error & confidence settings: Reduced data amount vs. overall error bound parameter

summary Sublinear algorithms are much more efficient than linear algorithms for massive data sets A good sample strategy is needed Many applications in graph theory

Reference Slides from Dr. Ronnit Rubinfeld’s website http://people.csail.mit.edu/ronitt/sublinear.html Slides from Dr. Dana Ron’s website http://www.eng.tau.ac.il/~danar/talks.html D. Wong, Y. Long, F. Ergun, “A layered architecture for delay sensitive sensor networks” http://www.cse.psu.edu/~sxr48/ Dan Wang and Zhu Han, “Sublinear Algorithms for Big Data Applications,” Springer, 2015s

Thanks