The latte Stream-Archive Query Project - Exploring Stream+Archive Data in Intelligent Transportation Systems Jin Li (with Kristin Tufte, Vassilis Papadimos, David Maier, and Robert L. Bertini) Summer, 2007 Funded by the National Science Foundation
2 Why? Stream queries provide us current, real- time state of a system However, you might also want to know Is today’s (vehicle) traffic better than average or worse than average? Are loop sensors on I-5 NB go wrong? Requires combining the live stream data with data from the archive In particular, compare with ‘similar’ data from the archive
3 latte architecture NiagaraST/ latte PORTAL Archive (PostgreSQL) stream operators Live Stream (ODOT) ……
4 Outline Introduction/Background on latte Retrieving archive data latte query evaluation Demo
5 NiagaraST – Stream Query Engine Stream query processing system Extended from the Niagara query engine Joint work with UW-Madison Supports XML input Support for stream queries Window query semantics and evaluation - formally define window Data semantics - punctuation Out of order processing Handling data skew scan select (I-5 mp 297.5) window avg(speed)
6 PORTAL – Data Archive PORTAL: Portland Oregon Regional Transportation Archive Listing Official transportation archive for Portland Metropolitan Region Relational database archive (PostgreSql) PORTAL receives a stream of live data from ODOT Every 20 seconds receive: speed, volume, occupancy, status from 500 loop detectors in the Portland area, archived since July 2004 Data is provided in XML format Every 20 seconds scripts parse the data and insert it into the database Aggregations are performed every 5 minutes and overnight
7 PORTAL – Data Archive (Cont.) Loop Detector Data 20 s count, lane occupancy, speed from 500 detectors – archived since July 2004 Incident Data >155,000 since 1999 Bus AVL Data Under Development DMS Data 19 VMS since 1999 Data Archive Archived loop detector data since July 2004 About 3 million records/day About 500GB, almost 3 billion rows in database
8 Data Quality Issues in ITS Data quality is a big issue (about 20% dirty data) Missing data Communication Failure Construction Cabinet Damage Corrupted Data Detectors degrade over time Calibration errors (loop spacing) Physical incidents (equipment parking) ITS expert provides constraints on data values e.g. low speed and low volume are likely to be wrong Trying to fill in the data gap intelligently
9 Outline Introduction/Background on latte Retrieving archive data latte query evaluation Demo
10 latte architecture NiagaraST/ latte PORTAL Archive (PostgreSQL) stream operators Live Stream (ODOT) …… For each piece of stream data, we retrieve ‘similar’ data from archive What does ‘similar’ mean?
11 What does similar mean? Application dependent No standard User defined Preprocessing can be hard or infeasible Similarity definitions for ITS Compare today’s data to data from five previous weekdays But …Friday traffic is very different from Wednesday traffic Compare today to five previous Wednesdays What about weather? Better: Compare today to five previous Wednesdays where the weather (rainfall) is similar to today’s weather Hard: Compare to days with traffic conditions (speed, volume) similar to today’s conditions
12 Retrieving Similar Data Expect database retrieval to be the bottleneck Reactive: query per tuple/event May perform poorly Continuity of similarity definitions If measurements at time A and time B are similar, measurements at time A+1 and time B+1 tend to be similar too Time is continuous weather, speed, vehicle position NiagaraST/ latte PORTAL Archive (PostgreSQL) stream operators Live Stream (ODOT) ……
13 Retrieving Similar Data (Cont.) Predictive: prefetching based on similarity definition (latte) Fetch database data too early – requires buffering Fetch data too late – stream will have to stall Dynamically adapt to database load NiagaraST/ latte PORTAL Archive (PostgreSQL) stream operators Live Stream (ODOT) ……
14 Outline Introduction/Background on latte Retrieving archive data latte query evaluation Demo
15 adaptive extractor window aggregate project join PORTAL filter (data cleaning) NiagaraST/ latte SQL queries speed, volume, occupancy for each detector, updates received every 20 seconds cleaned stream data average stream speed grouped by segmentId and windowId average archive speed grouped by segmentId and windowId windowId, segmentId, stream speed, archive speed database queries perform low-level (pane) aggregation database tuples latte query plan window aggregate join db scan (segmentids) stream scan Builds and issues queries to db Dynamically adapts Punctuates ‘stream’ of data from db punctuation
16 Database Access – Porthole Scan live stream 22 April, 4:34 15 April, 4:34 8 April, 4:34 1 April, 4: April, 4:34 sliding- windo w avg sliding- windo w aggr. ⋈ Our ideal view of data archive access Database continuously provides archive data from several different places in synchronization with the input stream Assume continuity of similarity definition last 4 same-day-of-week days
17 Database Access – Paned Aggregation Based on a query template (based on a similarity specification) and current stream time (from punctuation), extractor builds and issues queries to database Pane aggregation in database queries to reduce data communication A simplified example – SELECT floor(extract('epoch' from (timestamp - (TIMESTAMP ' :00:00' - interval “7 days”)))/60) as pane-id, seg-id, sum(speed), count(*) FROM loopdata_20sec WHERE (curtime – interval “7 days”) < timestamp <= (curtime – interval “7 days”) + 5 min GROUP BY pane-id, seg-id live stream 22 April, 4:34 15 April, 4:34 8 April, 4:34 1 April, 4: April, 4:34 sliding- windo w avg pane aggr. here windo w roll- up ⋈
18 Database Access - Adaptation How much data each database query should fetch and when Dynamically adapts the granularity and the amount of prefetching for database queries Response time of database queries high-watermark time of archive data High-watermark time of stream data
19 Future Work More on adaptive archive access Multi-query sharing
20 Demo
21 Similarity-Condition Selection and Speed Map-Display
22 Adaptivity and Database Access Current stream time Green: Query completed, data buffered in NiagaraST Yellow: Query in progress Empty: Data purged Pink: Query in progress, data late Red: Query not started yet, data late Each box represents a query issued to the PORTAL database; each query requests data for a certain period of time on a certain day. The horizontal axis is time; the horizontal span of the box represents the coverage (time extent) of the database query. Lag = (database high-water mark) − (stream high-water mark)
23 Questions?