1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

C6 Databases.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Kien A. Hua Division of Computer Science University of Central Florida.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Re-thinking Data Management for Storage-Centric Sensor Networks Deepak Ganesan University.
Chapter 14 The Second Component: The Database.
Introduction to Data Warehousing Enrico Franconi CS 636.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
Models and Issues in Data Streaming Presented By :- Ankur Jain Department of Computer Science 6/23/03 A list of relevant papers is available at
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Chapter 1 Overview of Databases and Transaction Processing.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
CPS 216: Advanced Database Systems Shivnath Babu.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
Query Processing, Resource Management, and Approximation in a Data Stream Management System.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Lecture A/18-849B/95-811A/19-729A Internet-Scale Sensor Systems: Design and Policy Lecture 15 Sensor Databases & Data Stream Systems Phil Gibbons.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
Chapter 1 Overview of Databases and Transaction Processing.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Managing Data Resources File Organization and databases for business information systems.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Mining Data Streams (Part 1)
Advanced Database Systems: DBS CB, 2nd Edition
COMP3211 Advanced Databases
The Stream Model Sliding Windows Counting 1’s
Lecture 16: Data Storage Wednesday, November 6, 2006.
Mining Dynamics of Data Streams in Multi-Dimensional Space
MANAGING DATA RESOURCES
Models and Issues in Data Stream Systems
Online Analytical Processing Stream Data: Is It Feasible?
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
Presentation transcript:

1 Stream-based Data Management IS698 Min Song

2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge amount data sets Traditional DBMS — data stored in finite, persistent data sets  Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive — single linear scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

3 Stream Data Applications  Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive)

4 Data Streams vs. Data Sets Data Sets:Data Streams:  Updates infrequent  Data changed constantly (sometimes additions only)  Old data required many times  Mostly only freshest data used  Example: employees personal data table  Examples: financial tickers, data feeds from sensors, network monitoring, etc

5 Using Traditional Database User/ApplicationUser/Application LoaderLoader QueryResult Result…Query…

6 Data Streams Paradigm User/ApplicationUser/Application Register Query Stream Query Processor Result

7 Data Streams Paradigm User/ApplicationUser/Application Register Query Stream Query Processor Result Scratch Space (Memory and/or Disk) Data Stream Management System (DSMS)

8 DBMS versus DSMS Persistent relations Transient streams (and persistent relations)

9 DBMS versus DSMS Persistent relationsTransient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries

10 DBMS versus DSMS Persistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access

11 DBMS versus DSMS Persistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access Access plan determined by query processor and physical DB designAccess plan determined by query processor and physical DB design Unpredictable data arrival and characteristicsUnpredictable data arrival and characteristics

12 DBMS versus DSMS Persistent relations Transient streams (and persistent relations) One-time queriesOne-time queries Continuous queriesContinuous queries Random accessRandom access Sequential accessSequential access Access plan determined by query processor and physical DB designAccess plan determined by query processor and physical DB design Unpredictable data arrival and characteristicsUnpredictable data arrival and characteristics “Unbounded” disk store“Unbounded” disk store Bounded main memoryBounded main memory

13 Challenges of Stream Data Processing  Multiple, continuous, rapid, time-varying, ordered streams  Main memory computations  Queries are often continuous Evaluated continuously as stream data arrives Answer updated over time  Queries are often complex Beyond element-at-a-time processing Beyond relational queries (scientific, data mining, OLAP)  Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi- dimensional in nature

14 Processing Stream Queries  Query types One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples  Approximate query answering With bounded memory, it is not always possible to produce exact answers High-quality approximate answers are desired Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc.

15 Methods for Approximate Query Answering  Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications  Batched processing, sampling and synopses Batched if update is fast but computing is slow  Compute periodically, not very timely Sampling if update is slow but computing is fast  Compute using sample data, but not good for joins, etc. Synopsis data structures  Maintain a small synopsis or sketch of data  Good for querying historical data  Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the entire input

16 Projects on DSMS (Data Stream Management System)  Research projects and system prototypes STREAM STREAM (Stanford): A general-purpose DSMS Cougar Cougar (Cornell): sensors Aurora Aurora (Brown/MIT): sensor monitoring, dataflow Hancock Hancock (AT&T): telecom streams Niagara Niagara (OGI/Wisconsin): Internet XML databases OpenCQ OpenCQ (Georgia Tech): triggers, incr. view maintenance Tapestry Tapestry (Xerox): pub/sub content-based filtering Telegraph Telegraph (Berkeley): adaptive engine for sensors Tradebot Tradebot ( stock tickers & streamswww.tradebot.com Tribeca Tribeca (Bellcore): network monitoring Streaminer Streaminer (UIUC): new project for stream data mining

17 Stream Data Mining vs. Stream Querying  Stream mining — A more challenging task It shares most of the difficulties with stream querying Patterns are hidden and more general than querying It may require exploratory analysis  Not necessarily continuous queries  Stream data mining tasks Multi-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream data Clustering data streams Classification of stream data

18 Challenges for Mining Unusual Patterns in Data Streams  Most stream data are at pretty low-level or multi- dimensional in nature: needs ML/MD processing  Analysis requirements Multi-dimensional trends and unusual patterns Capturing important changes at multi-dimensions/levels Fast, real-time detection and response

19 Summary  Stream data analysis: A rich and largely unexplored field Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms Stream data mining and stream OLAP analysis  Powerful tools for finding general and unusual patterns  Largely unexplored: current studies only touched the surface  Lots of exciting issues in further study A promising one: Multi-level, multi-dimensional analysis and mining of stream data

20 What Is A Continuous Query ? Query which is issued once and logically run continuously.

21 What is Continuous Query ? Query which is issued once and run continuously. Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

22 What is Continuous Query ? Query which is issued once and run continuously. More examples: Continues queries used to support load balancing, online automatic trading at Stock Exchange

23 Special Challenges Timely online answers even for rapid data streams Ability of fast access to large portions of data Processing of multiple streams simultaneously

24 Making Things Concrete Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) event = start or end Central Office Central Office DSMS BOBALICE

25 Making Things Concrete  Database = two streams of mobile call records  Outgoing(connectionID, caller, start, end)  Incoming(connectionID, callee, start, end)  Query language = SQL FROM clauses can refer to streams and/or relations

26 Query 1 (self-join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end)  Result requires unbounded storage  Can provide result as data stream  Can output after 2 min, without seeing end

27 Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID  Can still provide result as data stream  Requires unbounded temporary storage …  … unless streams are near-synchronized

28 Query 3 (group-by aggregation) Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller  Cannot provide result in (append-only) stream. Alternatives: Output stream with updates Provide current value on demand Keep answer in memory

29 Conclusions  Conventional DBMS technology is inadequate  We need reconsider all aspects of data management and processing in presence of data streams