SWiM Benchmark Brainstorming Dave Maier Mike Stonebraker and All of You! With thanks to Jim Gray for suggestions
SWiM Benchmark Properties Streamish Credible Scalable Realistic Input Approximable Expressively Challenging Portable Runnable
SWiM Streamish Source-driven data delivery Rapid arrival Infeasible to store all? (or low value to save?) “Live” output (output during input)
SWiM Credible Motivated by a likely application Measures useful work Simple to understand One approach: find an existing application that is done with custom coding, abstract from it
SWiM Scalable Stream rate & output volume # of streams Size of stream elements? Number of queries Memory requirements Stored data
SWiM Realistic Input Streams vary –bursts –stalls –diurnal cycles Stream sources come and go
SWiM Approximable Best stream rate vs. best answer at a given rate vs. most queries at a given rate Need metric for answer quality –latency –precision –correctness –completeness
SWiM Expressively Challenging? Range of query types –full stream –windowed –historic Range of stream semantics –signal –snapshots –cyclic –deltas
SWiM Portable Representation neutral: can be done with tuples, XML, messages Can be implemented on a wide variety of platforms: RDBMS, stream database, web- service engine
SWiM Runnable Can be run in a reasonable time –hard to test space management –limit on variations and cases Can generate streams in a repeatable manner, controlled variability Can build harness for testing quality metrics –comparison to ideal –capture timings –hard to cheat
SWiM NEXMark Stream Benchmark Niagara Extension of XMark XMark: XML Query Benchmark Models an on-line auction site Person(id, name, , ccard, city, state) Auction(id, itemname, desc, initbid, reserve, expires, seller, category) Bid(auction, bidder, price, dt-time) Plus static category data
SWiM Auction Monitoring System Category Data Bid Auction Person Bid Auction Monitoring System Streamed Results
SWiM Queries Full-stream and windowed –single-stream –stream and stored –multi-stream Query 5 (Hot items): Item with the most bids in past hour, each minute. SELECT Rstream(auction) FROM (SELECT B1.auction, count(*) AS num FROM Bid [RANGE 60 MINUTE SLIDE 1 MINUTE] B1 GROUP BY B1.auction) WHERE num >= ALL (SELECT count(*) FROM Bid [RANGE 60 MINUTE SLIDE 1 MINUTE] B2 GROUP BY B2.auction)
SWiM Metrics Quality-Latency Product Penalties for wrong, missing, extra tuples times average latency Can weight importance Output Matching Difference from ideal
SWiM Scaling Number of Bid streams Rate on Person, Auction streams Stored data size Test duration (?)
SWiM Application: TV Remote Controls Massive clickstream (thx to D. Schrader, NCR) –140 Million households w/ TV –3½ hours of viewing per day –19 clicks per hour You do the math … Obvious data mining uses, but also presents operational opportunities –Guarantee a given number “distinct viewings” of a commercial –need to correlate with schedule info (network, local station, cable co.)