Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
A Privacy Preserving Index for Range Queries
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Introduction to Histograms Presented By: Laukik Chitnis
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Dynamic Bayesian Networks (DBNs)
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 
Introduction to stochastic process
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.
On the Constancy of Internet Path Properties Yin Zhang, Nick Duffield AT&T Labs Vern Paxson, Scott Shenker ACIRI Internet Measurement Workshop 2001 Presented.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Modeling spatially-correlated sensor network data Apoorva Jindal, Konstantinos Psounis Department of Electrical Engineering-Systems University of Southern.
Thanks to Nir Friedman, HU
Statistical Multiplexer of VBR video streams By Ofer Hadar Statistical Multiplexer of VBR video streams By Ofer Hadar.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Information Networks Power Laws and Network Models Lecture 3.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Probability theory 2 Tron Anders Moger September 13th 2006.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Erasure Coding for Real-Time Streaming Derek Leong and Tracey Ho California Institute of Technology Pasadena, California, USA ISIT
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
7 - 1 © 1998 Prentice-Hall, Inc. Chapter 7 Inferences Based on a Single Sample: Estimation with Confidence Intervals.
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
1 EE571 PART 3 Random Processes Huseyin Bilgekul Eeng571 Probability and astochastic Processes Department of Electrical and Electronic Engineering Eastern.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Uncertain Observation Times Shaunak Chatterjee & Stuart Russell Computer Science Division University of California, Berkeley.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Probability Distributions Chapter 6.
OPERATING SYSTEMS CS 3502 Fall 2017
Data Transformation: Normalization
RF-based positioning.
Statistical NLP: Lecture 7
New Characterizations in Turnstile Streams with Applications
Statistical Data Analysis
Spatial Online Sampling and Aggregation
Load Shedding Techniques for Data Stream Systems
STOCHASTIC HYDROLOGY Random Processes
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Statistical Data Analysis
Presentation transcript:

Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department Boston University

Data Stream Management System ApplicationApplication Query (e.g. Joins over two streams) Query Processor Result Memory Data Stream Management System (DSMS) Select tuples that maximize the query metrics Unselected tuples

Observations Storage / Computation limitation Full contents of tuples of interest cannot be stored in memory. Cast as “caching” problems Query processing with memory constraint.

“Caching” Problem in DSMS window size is the memory size sliding window joins What tuples to store to max the size of join results? sum of Locality of reference properties (Denning & Schwatz)

Locality-Aware Algorithms Our Locality-aware algorithms Previous algorithms

Our Contributions Cast query processing with memory constraint in DSMS as “caching” problem and analyze the two causes of reference locality Provide a mathematical model and simple method to infer it to characterize the reference locality in data streams Show how to improve performance of data stream applications with locality-aware algorithms

Reference Locality - Definition In a data stream recently appearing tuples have a high probability of appearing in the near future.

Inter Arrival Distance (IAD) A random variable that corresponds to the number of tuples separating consecutive appearances of the same tuple …… 30101IAD

Calculate distribution of IAD Where p i is the frequency of value i in this data stream iacebai…… xnxn x n+k distance is k

Sources of Reference Locality Long-term popularity vs. Short-term correlation (web traces, Bestavros and Crovella) MS IBM MSGGIBMMS Reference locality due to long-term popularity …… For example: Stock Traces AMS AAGG MSIBM Reference locality due to short-term correlation …… George’s Company A listed today!

Independent Reference Model With the independent, identically- distributed (IID) assumption: Problem: only captures reference locality due to skewed popularity profile.

Metrics of Reference Locality How to distinguish the two causes of reference locality? Compare IAD distribution of the two! AMS AAGG MSIBM Original Data Stream S …… MS AGGIBM MSIBM Random Permutation of S ……

Stock Transaction Traces Daily stock transaction data from INET ATS, Inc. Zipf-like Popularity Profile (log-log scale)

Stock Transaction Traces CDF of IAD for Original and Randomly Permuted Traces Still has strong reference locality, due to skewed popularity distribution

Network OD Flow Traces Network traces of Origin-Destination (OD) flows in two major networks: US Abilene and Sprint-Europe Zipf-like Popularity Profile (log-log scale)

Network OD Flow Traces CDF of IAD for Original and Randomly Permuted Traces

Outline Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization Performance Study Conclusion

Locality-Aware Stream Model stream S … Indexx n-1 P(x n =x n-4 )=a 4 Recent h tuples x n-h Popularity Distribution of S P Recent h tuples of S 5 xnxn

Locality-Aware Stream Model stream S … Indexx n-1 2 xnxn P(x n =2 from popularity profile)=b*p(2) Recent h tuples x n-h Recent h tuples of SPopularity Distribution of S P

Locality-Aware Stream Model Xn=Xn= X n-i with probability a i Y with probability b where 1  i  h, and Y is a IID random variable w.r.t P, and where  (x k,c)=1 if x k =c, and 0 otherwise. Similar model appears for caching of web-traces, example Konstantinos Psounis, et. al

Infer the Model Expected value for x n : Least square method: minimize over a 1, …, a h, b: Make N observations, infer a i and b (h+1) parameters

Model on Real Traces- Stock b: degree of reference locality due to long-term popularity 1-b: … due to short-term correlation

Model on Real Traces- OD Flow

Utilizing Model for Prediction x n-h xnxn x n-1 …x n+1 x n+2 …x n+T S …… T The expected number of occurrence for tuple with value e in a future period of T, E T (e). Using only T+1 constants calculated based on the locality model of S

Outline Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization Performance Study Conclusion

Approximate Sliding Window Join window size is the memory size sliding window joins What tuples to store to max the size of join results? sum of

Existing Approach Metrics: Max-subset Previous approach: Random load shedding: poor performance (J. Kang et. al, A. Das et. al) Frequency model: IID assumption (A. Das et. al) Age-based model: too strict assumption (U. Srivastava et. al) Stochastic model: not universal (J. Xie et. al)

Marginal Utility … Stream S … 810… Stream R … nn-1 T=5

Calculate Marginal Utility 10x8x13xx89 S …… n Tuple Index: 97… n x ? P1P1 P2P2 … R Based on locality model, we can show that: where F depends the characteristic equation of P i which is a linear recursive sequence!

ELBA Exact Locality-Based Algorithm (ELBA) Based on the previous analysis, calculate the marginal utility of tuples in the buffer, evict the victim with the smallest value Expensive

LBA Locality-Based Algorithm (LBA) Assume T is fixed, approximate marginal utility based on the prediction power of locality model. Depends on only T+1 constants that could be pre-computed.

Space Complexity A histogram stores both P over a domain size D and T+1 constants histogram space usage is poly logarithm: O(poly[logN]) space usage for N values (A. Gilbert, et. al)

Sliding window join: varying buffer size – OD Flow

Sliding window join: varying buffer size - Stock

Sliding window join: varying window size - stock

Conclusion Reference locality property is important for query processing with memory constraint in data stream applications. Most real data streams have strong temporal locality, i.e. short term correlations. How about spatial locality, i.e. correlation among different attributes of the tuple?

Thanks!

Approximate Count Estimation Derive much tighter space bound for Lossy-counting algorithm (G. Manku et. al) using locality-aware techniques. Tight space bound is important, as it tells us how much memory space to allocate.

Data Summarization Define Entropy over a window in data stream using locality-aware techniques, instead of the normal way of entropy definition …… …… Important for data summarization, change detection, etc. For example:

Data Stream Entropy Data StreamsLocality-Aware Entropy Uniform IID6.19 Permuted Stock Stream5.48 Original Stock Stream3.32 Higher degree of reference locality infers less entropy