Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Slides:

Advertisements

Similar presentations

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Advertisements

Visual Recognition Tutorial

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Evolutionary Computational Intelligence Lecture 10a: Surrogate Assisted Ferrante Neri University of Jyväskylä.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,

Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.

Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.

On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.

Radial Basis Function Networks

Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

COGNITIVE RADIO FOR NEXT-GENERATION WIRELESS NETWORKS: AN APPROACH TO OPPORTUNISTIC CHANNEL SELECTION IN IEEE BASED WIRELESS MESH Dusit Niyato,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.

1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.

Classification Ensemble Methods 1

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Bootstrapped Optimistic Algorithm for Tree Construction

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Mining of Massive Datasets Edited based on Leskovec’s from

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference

Virtual University of Pakistan

Quantitative Methods for Business Studies

Big data classification using neural network

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

Randomness and Computation

A paper on Join Synopses for Approximate Query Answering

Lecture 22: Linearity Testing Sparse Fourier Transform

Supervised Time Series Pattern Discovery through Local Importance

Mean Value Analysis of a Database Grid Application

Randomized Algorithms

Multimodal Learning with Deep Boltzmann Machines

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Spatial Online Sampling and Aggregation

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Statistical Methods For Engineers

Presenter: Xudong Zhu Authors: Xudong Zhu, etc.

Data Integration with Dependent Sources

Sequence comparison: Multiple testing correction

Randomized Algorithms

Discriminative Frequent Pattern Analysis for Effective Classification

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Ensemble learning.

Statistical Data Analysis

Linköping University, IDA, ESLAB

Designing Neural Network Architectures Using Reinforcement Learning

Lecture 6: Counting triangles Dynamic graphs & sampling

Minwise Hashing and Efficient Search

Chapter-5 Traffic Engineering.

Approximate Mean Value Analysis of a Database Grid Application

Presentation transcript:

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides Signal processing and Networking for Big Data Applications Lecture 10: Sublinear Algorithm Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

outline Motivations Inequalities and classifications Examples Applications

Motivation for Sublinear-Time Algorithms Massive datasets world-wide web online social networks genome project sales logs census data high-resolution images scientific measurements Long access time communication bottleneck (slow connection) implicit data (an experiment per data point)

What Can We Hope For? What can an algorithm compute if it reads only a sublinear portion of the data? runs in sublinear time? Some problems have exact deterministic solutions For most interesting problems algorithms must be approximate randomized Quality of approximation Resources number of queries running time

Types of Approximation Classical approximation need to compute a value output should be close to the desired value example: average Property testing need to answer YES or NO Intuition: only require correct answers on two sets of instances that are very different from each other In cases when we need to compute some value, it is clear what we mean by "approximation". The output should be close to the desired value. This is a classical notion, and everybody has heard of approximating the average and median values by sampling.

Why is it useful Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) Networking applications (counting and detecting patterns in small space) Distributed computations (small sketches to reduce communication overheads) Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M Today: Applications to soccer

Puzzles 5 1 8 11 9 7 6 3 4 2

Which number was missing?

Puzzle #1

Puzzle #2 (google interview Question)

Answers to the puzzles Uniform probability for each sample even it is i>s

outline Motivations Inequalities and classifications Examples Applications

Inequalities Markov inequality Chebyshev inequality Chernoff bound

Markov’s Inequality

Markov Inequality: Example

Markov Inequality: Example

Markov Inequality: Example

Markov + Union Bound: Example

Chernoff bound

Chernoff bound (corollary)

Chernoff: Example

Chernoff: Example

Sublinear Algorithms Classification

outline Motivations Inequalities and classifications Examples Applications

A Housewife Example Assume that there is a group of people who can be classified into different categories. One category is the housewife. We want to know the percentage of the housewife in this group, but the group is too big to examine every person. A simple way is to sample a subset of people and see how many of these people in it belong to the housewife group. This is where the question arise: how many samples are enough?

A Housewife Example Not a function of data size!

A Housewife Example

A Two Cat Problem Deterministic Algorithm

A Two Cat Problem

A Two Cat Problem 1,3,6,10,15,21,28 Total number is square root of n, between the number of two samples is also square root of n When you have two pieces of resources, split them even.

outline Motivations Inequalities and classifications Examples Applications

Pricing and Sublinear Algorithms: Motivation Overall picture:

Pricing and Sublinear Algorithms Objectives: Design a differentiating user services model for profit gain computing based on different types of users Enable the services model staying efficient in big data context with performance guarantees Underlying philosophy: classify users first and then use corresponding typical user behavior instead of actual user usage as the approximation and estimation Advantages: Able to perform prediction Fast computation speed Save storage capacity

Pricing and Sublinear Algorithms: Pricing Model Differentiating user service model: Simplify into 2 types of users in total, i.e., L=2: user type indicator load profiling expectation of m-th type user N: # of users L: # of user types total bill gain bill charge for typical m-th type user

Pricing and Sublinear Algorithms: Pricing Model Model the expense: Total net profit gain: Xij: i-th user energy usage at time instant j ap: cost coeff to buy energy at peak hour ao: cost coeff to buy energy at off-peak hour

Pricing and Sublinear Algorithms Classify users to compute α and β: Algorithm quality:

Pricing and Sublinear Algorithms Sublinear on percentage calculation: “no need of every user for the computation” Not a function of N, complexity O(1)

Pricing and Sublinear Algorithms Sublinear on classification/distribution comparison: “no need of every data points for the comparison” Existent sublinear algorithm for L2-distance test:

Pricing and Sublinear Algorithms Drawbacks: confidence remains undetermined when the L2-distance of two testing distributions is truly in interval [ε2/2, ε2] Proposed solution: utilize the existent algorithm twice

Pricing and Sublinear Algorithms 1> Employ the traditional sublinear sampling and obtain labeled results as set {S1} 2> Employ the traditional sublinear sampling with twice larger of the error bound and obtain labeled results as set {S1} 3> Keep the labeled 1 in {S1} and reject all the labeled 2 4> Keep the labeled 2 in {S2} and reject all the labeled 1 5> Combine the retained labels into {S3}: if the same user is both labeled as 1 in {S1} and 2 in {S2}, his/her label is randomly decided 6> Output {S3} as the final classification results

Pricing and Sublinear Algorithms Overall algorithm flow: Call AlgoPercent() to sample a small portion of users for classification Call AlgoDist() to sample a small portion of each user’s distribution data points.

Pricing and Sublinear Algorithms: Numerical Results Bounded error vs. different parameterizations: Estimation errors vs. number of sub-sampling data points from the entire distribution Performance on estimating α

Pricing and Sublinear Algorithms: Numerical Results Profit gains vs. other pricing plans; reduced computation burdens: Net profits from different pricing strategies Reduced data amount vs. overall confidence parameter

Pricing and Sublinear Algorithms: Numerical Results reduced computation burdens vs. varying parameters and error & confidence settings: Reduced data amount vs. overall error bound parameter

summary Sublinear algorithms are much more efficient than linear algorithms for massive data sets A good sample strategy is needed Many applications in graph theory

Reference Slides from Dr. Ronnit Rubinfeld’s website http://people.csail.mit.edu/ronitt/sublinear.html Slides from Dr. Dana Ron’s website http://www.eng.tau.ac.il/~danar/talks.html D. Wong, Y. Long, F. Ergun, “A layered architecture for delay sensitive sensor networks” http://www.cse.psu.edu/~sxr48/ Dan Wang and Zhu Han, “Sublinear Algorithms for Big Data Applications,” Springer, 2015s

Thanks