Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).

Slides:



Advertisements
Similar presentations
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
CS4432: Database Systems II
Fast Algorithms For Hierarchical Range Histogram Constructions
Mining Data Streams.
1 Efficient and Robust Streaming Provisioning in VPNs Z. Morley Mao David Johnson Oliver Spatscheck Kobus van der Merwe Jia Wang.
Chapter 4 Conventional Computer Hardware Architecture
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Optimal Workload-Based Weighted Wavelet Synopsis
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 16: Recursion.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Recursion.
27 August EEE442 COMPUTER NETWORKS Test results & analysis.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Introduction - The Need for Data Structures Data structures organize data –This gives more efficient programs. More powerful computers encourage more complex.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Systems analysis and design, 6th edition Dennis, wixom, and roth
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 COMP3040 Tutorial 1 Analysis of algorithms. 2 Outline Motivation Analysis of algorithms Examples Practice questions.
Scheduling policies for real- time embedded systems.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
1 CS 350 Data Structures Chaminade University of Honolulu.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture
B+ Trees: An IO-Aware Index Structure Lecture 13.
One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
FILE ORGANIZATION.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Dense-Region Based Compact Data Cube
COMPUTER NETWORKS CS610 Lecture-27 Hammad Khalid Khan.
Data Transformation: Normalization
Top 50 Data Structures Interview Questions
The Stream Model Sliding Windows Counting 1’s
Packet Switching Datagram Approach Virtual Circuit Approach
Data Streaming in Computer Networking
Data-Streams and Histograms
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Database Management Systems (CS 564)
Spatial Online Sampling and Aggregation
FILE ORGANIZATION.
Winter 2018 CISC101 12/2/2018 CISC101 Reminders
Evaluation of Relational Operations: Other Techniques
Heavy Hitters in Streams and Sliding Windows
Wavelet-based histograms for selectivity estimation
Presentation transcript:

Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).

2 Overview The need Type of solutions Choice of solution Problems addressed How does wavelet transform work ? Implementation Results

3 The need … Huge data streams encountered at routers, telephone switches, stock exchanges etc. Necessity to analyze this data for trend-related analysis, and fraud detection. Analysis to be done as fast as possible for mission- critical tasks as detecting fraud, security breaches etc. What are the possible ways of analysis ?

4 Solutions … Offline processing – Archive whole data in real-time, and analyze it offline. (Slower w.r.t. basic motives of analysis i.e. fraud-detection and performance.) Real-time processing – Analyze the data as it arrives …  In Multiple passes – easiest method.. But slower and inefficient w.r.t. load of system.  In Single pass – Requires special implementation techniques.. But faster and efficient.

5 Real time processing of Data in Single pass Methods used – Wavelet Transform, Sampling techniques, MaxDiff algorithm. Why Wavelet Transform ? –  Storing fairly approximate “sketch” of data in smaller space.  Answering simple point and range queries with quite good approximation from stored “sketch”.  Known to perform better than other techniques and easier for implementation. Note: Comparative analysis of these techniques is outside the scope of this project.

6 Data Stream query Block diagram of implementation technique Find the Wavelet transform coeffs Select the “m” best coefficients

7 Key aspects of implementation.. Single pass over data (obviously!!) At any point while processing data only O(N) memory is used where N is the number of data items being considered. Selecting “m” best data coefficients out of N data items.. such that they give minimum error in retrieval of original value of data-key. Storing these “m” coefficients instead of all N data items (m << N).

8 How does Wavelet Transform work.. [2 279] [ 2052] D: 0 a:2 D: 2 a: 8 D: 6 a: 5 Original data [ ] Wavelet coefficients

9 Tree not stored S0 S1 S2 S3 S4S5S6 S7

10 How queries are answered … Point queries – Find the number of calls made by telephone number Find value of key 5.. Value( ) = S(0) + ½*S(1) + ½*S(3) - ½*S(6) Range queries – Find number of calls made from exchange Answer to this query is the root of tree having numbers from exchange 2422 as leaves. i.e. S1, S2 etc.

11 Brief about implementation.. Data input from a file Reading file sequentially to simulate single pass over data-stream, and not accessing previous data of file. Forming the coefficient tree in the form of linked list. Storing “m” best coefficients. A program to calculate point and range queries from coefficient and answer back to user.

12 Few points to note … Very basic implementation … cannot handle data fed in any arbitrary format. Assumptions –  Assumes incoming data in key-value pair (e.g. key is tele- number , value is number of calls made from it in last 1 hr. “ ”=>6  Incoming data stream is in ordered-aggregate form. Selection of “m” best coefficients changes according to data-stream types.

13 contd … Currently we take highest first “m” coefficient by sorting them … not the best approach. Multi-dimensional data-streams not considered (discussed in research papers referred for implementation).

14 Our results … N = 4096 ************************************************************************** VALUE OF m = N ************************************************************************** RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS 7505 THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS RESULT OF RANGE QUERY VALUE FROM 1016 TO 1021 IS : ************************************************************************** VALUE OF m = 50% of N ************************************************************************** RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS RESULT OF QUERY VALUE FROM 1016 TO 1021 IS :

15 contd … ************************************************************************** VALUE OF m = 25% of N ************************************************************************** THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 1011 IS THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 3067 IS THE RESULT OF THE QUERY: SELECT VALUE WHERE KEY = 2015 IS RESULT OF QUERY VALUE FROM 1016 TO 1021 IS :

16 References Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. S. Muthukrishnan, A.C. Gilbert, Y. Kotidis, M. Strauss, Wavelet-based histograms for selectivity estimation. J. Vitter, Y. Matias, M. Wang, 1998.

Thank you