P2P Integration, Concluded, and Data Stream Processing

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
The Design of the Borealis Stream Processing Engine Brandeis University, Brown University, MIT Magdalena BalazinskaNesime Tatbul MIT Brown.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Data Integration and Exchange for Scientific Collaboration DILS 2009 July 20, 2009 Zachary G. Ives University of Pennsylvania Funded by NSF IIS ,
Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 30, 2005.
1 Provenance in O RCHESTRA T.J. Green, G. Karvounarakis, Z. Ives, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
STREAM The Stanford Data Stream Management System.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008.
Methodology – Physical Database Design for Relational Databases.
7 Strategies for Extracting, Transforming, and Loading.
Triggers and Streams Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 28, 2005.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Update Exchange with Provenance Schemas are related by GLAV schema mappings (tgds) : M4: Domain_Ref(SrcID, 'Interpro', ITAcc), Entry2Meth(ITAcc, DBAcc,
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Grigoris Karvounarakis Zachary G. Ives University of Pennsylvania Bidirectional Mappings for Data and Update Exchange WebDB 2008.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
CPSC-310 Database Systems
Databases and DBMSs Todd S. Bacastow January 2005.
Practical Database Design and Tuning
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
OPERATING SYSTEMS CS 3502 Fall 2017
Module 11: File Structure
COMP3211 Advanced Databases
Introduction to Load Balancing:
Parallel Databases.
Applying Control Theory to Stream Processing Systems
SOFTWARE DESIGN AND ARCHITECTURE
Relational Algebra Chapter 4 1.
Methodology – Physical Database Design for Relational Databases
Physical Database Design for Relational Databases Step 3 – Step 8
Database Performance Tuning and Query Optimization
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Translation of ER-diagram into Relational Schema
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Data Stream Management System (DSMS)
Relational Algebra.
Data, Databases, and DBMSs
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Presenter Kyungho Jeon 11/17/2018.
Database management concepts
Relational Algebra Chapter 4 1.
Practical Database Design and Tuning
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
An Introduction to Software Architecture
File Storage and Indexing
Database management concepts
Query Processing CSD305 Advanced Databases.
Overview of Query Evaluation
Probabilistic Databases
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Distributed Database Management Systems
Adaptive Query Processing (Background)
An Analysis of Stream Processing Languages
Presentation transcript:

P2P Integration, Concluded, and Data Stream Processing Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 13, 2008

The Collaborative Data Sharing System (CDSS): Loosely Coupled, Highly Dynamic [Ives et al. CIDR05; SIGMOD Rec. 08] Logical P2P network mapping among databases Give peers control & updatability of local DB Relate DBs by mappings and trust policies Support update exchange & reconciliation Maintain data provenance to assess trust ∆B+/− Peer B ∆C+/− Peer C Work in “CDSS” here. Bullet 3, mention “to support collaboration...” Peer A DBMS ∆A+/− ∆A+/− Queries, edits PUBLISH

Dataflow in the CDSS [Taylor & Ives 06], [Green + 07], [Karvounarakis & Ives 08] Updates from all peers Updates from this peer P (A permanent log using P2P replication) CDSS archive Publish ∆Ppub ⇗ Publish updates Updates from all peers + Final updates for peer Mapped updates Import ∆Pother ⇘ σ ∆Pm   ∆P Delta P’s should be labels not boxes Translate through mappings with provenance: update exchange Apply trust policies using data + provenance Reconcile conflicts − Apply local curation

Update Exchange Maps across Structural Variations (m1) G(i,c,n)  B(i,n) (m2) G(i,c,n)  U(n,c) (m3) B(i,n)   c U(n,c) (m4) B(i,c)  U(n,c)  B(i,n) PGUS PBioSQL m2 m4 m3 m1 G(id,can, nam) B(id, nam) + - + - PuBio U(nam, can) + - uBio distrusts data from GUS along m2 Three-site CDSS “setting” for phylogeny (organism names & canonical names) Each site adds, deletes data (represented as an update log) Schema mappings specify how data is logically related Annotations called trust conditions specify what data is trusted, by whom

Update Exchange: Importing Updates to the Target Schema [Green, Karvounarakis, Ives, Tannen 07] (m1) G(i,c,n)  B(i,n) (m2) G(i,c,n)  U(n,c) (m3) B(i,n)   c U(n,c) (m4) B(i,c)  U(n,c)  B(i,n) PGUS PBioSQL Gl m1 Bl G(id,can, nam) + + - + B(id, nam) + - m3 - Br m2 Ul m4 U(nam, can) + + - - Ur PuBio Goal: propagate updates, which may delete data from “upstream” Approach: Condense edit logs into relations describing net effects on data Local contributions of new data to system (e.g., Ul) Local rejections of data imported from elsewhere (e.g., Ur)

An Example Program from the Mappings (m1) G(i,c,n)  B(i,n) (m2) G(i,c,n)  U(n,c) (m3) B(i,n)   c U(n,c) (m4) B(i,c)  U(n,c)  B(i,n) PGUS PBioSQL Gl m1 Bl G(id,can, nam) + + B(id, nam) m3 - Br m2 Ul m4 U(nam, can) + - Ur PuBio Gl(i,c,n)  G(i,c,n) Bl (i,n)  B (i,n) m1 G(i,c,n)  ¬Br(i,n)  B(i,n) Ul (n,c)  U (n,c) m2 G(i,c,n)  ¬Ur(n,c)  U(n,c) m3 B(i,n)  (c = f(i,n))  U(n,c) m4 B(i,c)  U(n,c)  ¬Br(i,n)  B(i,n) Expand mappings to consider local contributions, rejections Convert it into a recursive query Compute the instance for the target peer

Beyond the Basic Update Exchange Program Can generalize to perform incremental propagation given new updates Propagate updates downstream [Green+07] Propagate updates back to the original “base” data [Karvounarakis & Ives 08] But what if not all data is equally useful? What if some sources are more authoritative than others? We need a record of how we mapped the data (updates)

Provenance from Mappings Given our mappings: (m1) G(i,c,n)  B(i,n) (m2) G(i,c,n)  U(n,c) (m3) B(i,n)   c U(n,c) (m4) B(i,c)  U(n,c)  B(i,n) And the local contributions: B(id,nam) G(id,can,nam) m2 m4 m3 m1 PGUS PuBio PBioSQL U(nam,can) Gl Bl Ul p 3 : G ( , 5 2 ) p 1 : B ( 3 , 5 ) p 2 : U ( , 5 )

Provenance from Mappings Given our mappings: (m1) G(i,c,n)  B(i,n) (m2) G(i,c,n)  U(n,c) (m3) B(i,n)   c U(n,c) (m4) B(i,c)  U(n,c)  B(i,n) We can record a graph of direct tuple derivations: B(id,nam) G(id,can,nam) m2 m4 m3 m1 PGUS PuBio PBioSQL U(nam,can) Gl Bl Ul p 3 : G ( , 5 2 ) p 1 : B ( 3 , 5 ) p 2 : U ( , 5 ) m 2 U m 3 G B (5,c1) (3,5,2) (3,5) m 4 m 1 (2,5) (3,2) (2,c2) m 3 Can be formalized as polynomial expressions in a semiring [Green+07] Note U(2,5) true if p2 is correct, or m2 is valid and p3 is correct

From Provenance (and Data), Trust Each peer’s admin assigns a priority to incoming updates, based on their provenance (and value) Examples of trust conditions for peer uBio: Distrusts data that comes from GUS along mapping m2 Trusts data derived from m4 with id < 100 with priority 2 Trusts data directly inserted by BioSQL with priority 1 System combines priorities and uses them to determine a unique consistent instance for the peer Trust composes across mappings Trust composes across transactions and transaction dependencies

Wrapping up P2P Data Integration What if schemas and mappings change in a PDMS or a CDSS? Add a new schema version as if it’s a new schema But adding or changing a mapping may affect query reformulation (PDMS) and/or the instance (CDSS) See TJ Green’s thesis proposal (Wed 11AM) for the mapping evolution problem PDMS / CDSS model is complementary to Cimple Cimple: focuses on extraction / wrapping PDMS / CDSS: focuses on making use of schema mappings to share data Partly open: Where do mappings & trust come from?

A Variation on the Relational Model: Streams An interesting class of applications exists where data is constantly changing, and we want to update our output accordingly Publish-subscribe systems Stock tickers, news headlines Data acquisition, e.g., from sensors, traffic monitoring, … In general, we want “live” output based on changing input This has been called many things: pub/sub, continuous queries, … In general, these have been eclipsed by the term “stream processing”

What’s a Stream, and What Do We Do with It? A stream is a time-varying series of values of a particular data type In STREAM, they consider instead a set of values with timestamps – how does this differ? What kinds of operations might we perform over changing data? Aggregation: Over a time window, or a series of values Last value for each key Some combination thereof Joins … But over what? What about approximation? Why might that be useful?

STREAM’s Model: the CQL Language An attempt to extend SQL to handle streams – not to invent a language from the ground up Thus it’s a bit quirky In CQL, everything is built around instantaneous relations, which are time-varying bags of tuples Relation-relation operators (normal SQL) Stream-relation operators (convert to relations) Relation-stream operators (convert instantaneous to streams) No stream-stream operators!

Converting between Streams & Relations Stream-to-relation operators: Sliding window: tuple-based (last N rows) or time-based (within time range) Partitioned sliding window: does grouping by keys, then does sliding window over that Is this necessary or minimal? Relation-to-stream operators: Istream: stream-ifies any insertions over a relation Dstream: stream-ifies the deletes Rstream: stream contains the set of tuples in the relation

Some Examples Select * From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A = S2.A And S1.A > 10 Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

Building a Stream System Basic data item is the element: <op, time, tuple> where op 2 {+, -} Query plans need a few new (?) items: Queues Used for hooking together operators, esp. over windows (Assumption is that pipelining is generally not possible, and we may need to drop some tuples from the queue) Synopses The intermediate state an operator needs to carry around Note that this is usually bounded by windows

Example Query Plan What’s different here?

Some Tricks for Performance Sharing synopses across multiple operators In a few cases, more than one operator may join with the same synopsis Can exploit punctuations or “k-constraints” Analogous to interesting orders Referential integrity k-constraint: bound of k between arrival of “many” element and its corresponding “one” element Ordered-arrival k-constraint: need window of at most k to sort Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

Query Processing – “Chain Scheduling” Similar in many ways to eddies May decide to apply operators as follows: Assume we know how many tuples can be processed in a time unit Cluster groups of operators into “chains” that maximize reduction in queue size per unit time Greedily forward tuples into the most selective chain Within a chain, process in FIFO order They also do a form of join reordering

Scratching the Surface: Approximation They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in order to maintain a confidence interval May need to do similar things if memory usage is a constraint Are there other options? When might they be useful?

STREAM in General “Logical semantics first” Starts with a basic data model: streams as timestamped sets Develops a language and semantics Heavily based on SQL Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing

Aurora “Implementation first; mix and match operations from past literature” Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to build a real system with them! Emphasis is on building a scalable, robust system Distributed implementation: Medusa

Queries in Aurora Oddly: no declarative query language in the initial version! (Added for commercial product) Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops

Example Query

Some Interesting Aspects A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions) Need built-in error handling If a data source fails to respond in a certain amount of time, create a special alarm tuple This propagates through the query plan Incorporate built-in load-shedding, RT sched. to support QoS Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to a push-based mode (reading from network)

The Medusa Processor Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing

Main Components Lookup Brain Distributed catalog – schemas, where to find streams, where to find queries Brain Query setup, load monitoring via I/O queues and stats Load distribution and balancing scheme is used Very reminiscent of Mariposa!

Load Balancing Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state The state is simply dropped, and operator processing resumes Implications on semantics? Plans to support state migration “Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts negotiated offline Bounded-price mechanism – price for migration of load, spec for what a node will take on Does this address the weaknesses of the Mariposa model?

Some Applications They Tried Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps diagnose problems Shared computation among queries User-defined aggregation and mapping Linear road (sensor monitoring) Traffic sensors in a toll road – change toll depending on how many cars are on the road Combination of historical and continuous queries Environmental monitoring Sliding-window calculations

The Big Application? Military battalion monitoring Positions & images of friends and foes Load shedding is important Randomly drop data vs. semantic, predicate-based dropping to maintain QoS Based on a QoS utility function

Lessons Learned Historical data is important – not just stream data (Summaries?) Sometimes need synchronization for consistency “ACID for streams”? Streams can be out of order, bursty “Stream cleaning”? Adaptors (and also XML) are important … But we already knew that! Performance is critical They spent a great deal of time using microbenchmarks and optimizing