Download presentation
Presentation is loading. Please wait.
Published byMillicent Lyons Modified over 9 years ago
1
The Future of Data Management or The Structure of (Computer) Scientific Revolutions EECS BEARS Conference February 2007 Michael Franklin UC Berkeley & Amalgamated Insight, Inc.
2
Michael Franklin EECS BEARS Conference - February 2007 The Structure Spectrum Structured (schema-first) Relational Database Formatted Messages Semi-Structured (schema-later) XML Tagged Text/Media Unstructured (schema-never) Plain Text Media
3
Michael Franklin EECS BEARS Conference - February 2007 Structured Data Management
4
Michael Franklin EECS BEARS Conference - February 2007 A “Modern” View of Data Management
5
Michael Franklin EECS BEARS Conference - February 2007 Whither Structured Data? Conventional Wisdom: only 20% of data is structured. Decreasing due to: Consumer applications Enterprise search Media applications
6
Michael Franklin EECS BEARS Conference - February 2007 Structured Data Management Two reasons why this is where the future is: The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues.
7
Michael Franklin EECS BEARS Conference - February 2007 Structured Data Management Two reasons why this is where the future is: The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues. The “Data Industrial Revolution*”: Data used to be hand-crafted, now it’s machine-generated! * Credit to Prof. Joe Hellerstein for this analogy.
8
Michael Franklin EECS BEARS Conference - February 2007 Reason 1: Data Integration The ultimate schema-first problem. In the future, required for all applications. Structure is both an enabler and a key impediment. wrapper Mediated Schema Semantic mappings Courtesy of Alon Halevy
9
Michael Franklin EECS BEARS Conference - February 2007 Why Structure? What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…
10
Michael Franklin EECS BEARS Conference - February 2007 Why Structure?
11
Michael Franklin EECS BEARS Conference - February 2007 Why Structure? What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…
12
Michael Franklin EECS BEARS Conference - February 2007 Why Structure? Text “Search” can return only what’s been previously “stored”.
13
Michael Franklin EECS BEARS Conference - February 2007 What if you wanted to… find out the average donation of actors to each candidate? compare actor donations this campaign to the last one? find out who gave the most to each candidate? organize the information by source or age?
14
Michael Franklin EECS BEARS Conference - February 2007 A“Deep-Web” Query Approach SELECT y.name,f.occupation,… FROM Yahoo_Actors y, FECInfo f WHERE y.name = f.name
15
Michael Franklin EECS BEARS Conference - February 2007 Did it Work?
16
Michael Franklin EECS BEARS Conference - February 2007 What’s Missing? Common Schema Any Schema Strong Identifiers (keys) Data Independence Metadata Consistency Guarantees Access Control
17
Michael Franklin EECS BEARS Conference - February 2007 The Fundamental Tradeoff Functionality Time (and cost) Structured (schema-first) Unstructured (schema-less) Semi-Structured (schema-later) Structure enables computers to help users manipulate and maintain the data.
18
Michael Franklin EECS BEARS Conference - February 2007 “Flexible” Structure: Dataspaces* Deal with all the data from an enterprise – in whatever form Data co-existence no integrated schema, no single warehouse Pay-as-you-go services Keyword search is bare minimum. Data manipulation and increased consistency as you add work. * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.
19
Michael Franklin EECS BEARS Conference - February 2007 Databases vs. Dataspaces Data Coexistence Autonomous Sources Search, Browse, Approximate Answer Structured Query Best Effort Guarantees Single Schema Centralized Administration Structured Query Strict Integrity Constraints
20
Michael Franklin EECS BEARS Conference - February 2007 The World of Dataspaces HighLow Near Far Desktop Search Web Search Virtual Organization Federated DBMS DBMS Semantic Integration Administrative Proximity
21
Michael Franklin EECS BEARS Conference - February 2007 DataSpace Technology Probabilistic Databases Schema Matching Judicious use of User Input Approx. Query Answering Probabilistic Reasoning Uncertainty Management Data Model Learning Structured & Unstructured Search
22
Michael Franklin EECS BEARS Conference - February 2007 Reason 2: Data Industrial Revolution Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect Mainframes 1960s Minicomputers 1970s Microcomputers/PCs 1980s Web-based computing 1990s Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications for Operational Visibility, monitoring, and alerting.
23
Michael Franklin EECS BEARS Conference - February 2007 Data Streams Data Flood Clickstream Barcodes PoS System Sensors RFID Telematics Inventory Exponential data growth New challenges: continuous, inter- connected, distributed, physical Shrinking business cycles More complex decisions Phones Transactional Systems
24
Michael Franklin EECS BEARS Conference - February 2007 Device Data Management Devices generate streams of structured data. Wide-spread deployment will lead to huge data volumes. Can we develop the right infrastructure to support large-scale data streaming apps? Can we incorporate devices into existing (legacy) IT infrastructure?
25
Michael Franklin EECS BEARS Conference - February 2007 High Fan In Systems* A data management infrastructure for large-scale data streaming environments. Uniform Declarative Framework Every node is a SQL data stream processor stream-oriented queries at all levels Hierarchical, stream-based views as an organizing principle. Can impose a “view” over messy devices. * Design Considerations for High Fan In Systems - The HiFi Approach; CIDR 2005
26
Michael Franklin EECS BEARS Conference - February 2007 HiFi - Taming the Data Flood Receptors Warehouses, Stores Dock doors, Shelves Regional Centers Headquarters Hierarchical Aggregation Spatial Temporal In-network Stream Query Processing and Storage Fast Data Path vs. Slow Data Path
27
Michael Franklin EECS BEARS Conference - February 2007 “Virtual Device (VICE) API” Vice API is a natural place to hide much of the complexity arising from physical devices. VICE: Virtual Device Interface [Jeffery et al., Pervasive 2006, VLDBJ 07]
28
Michael Franklin EECS BEARS Conference - February 2007 Device Issues: example Shelf RIFD Test - Ground Truth
29
Michael Franklin EECS BEARS Conference - February 2007 Actual RFID Readings “Restock every time inventory goes below 5”
30
Michael Franklin EECS BEARS Conference - February 2007 Query-based Data Cleaning Point Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)
31
Michael Franklin EECS BEARS Conference - February 2007 Query-based Data Cleaning Point Smooth Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))
32
Michael Franklin EECS BEARS Conference - February 2007 After Query-based Cleaning “Restock every time inventory goes below 5”
33
Michael Franklin EECS BEARS Conference - February 2007 SQL Abstraction Makes it Easy “Soft Sensors” Quality and lineage Optimization (power, etc.) Pushdown of external validation information Data archiving Imperative processing …
34
Michael Franklin EECS BEARS Conference - February 2007 Complexity Performance Centralized Distributed Event-Driven Query-Driven Next-Generation Business Intelligence Amalgamated Insight: The Company RDBMS Data Warehouse Appliance In-Memory Accelerators Database/Data Warehouse Products Reporting Analysis Predictive Analytics Data Mining “Operational” BI/BAM Data Analytics Products
35
Michael Franklin EECS BEARS Conference - February 2007 Stream Query Processing is the Key Integrated Event Handling and Alerting Visibility Interfaces to Operational Systems Notification Learning Intelligent Action Drill Down, Replay, Reports “What’s happening now?” “Tell me when something happens.” “Why is it happening and how to improve it?” “Automatically react when things happen.”
36
Michael Franklin EECS BEARS Conference - February 2007 Company Overview Breakthrough technology for stream query processing Proven software base – leveraging open source platform Used in demanding high-volume networked applications Boyd Pearce, President and CEO Michael Franklin, Ph.D., CTO Michael Trigg, EVP, Marketing Sailesh Krishnamurthy, Ph.D., Chief Architect Robert Krauss, VP, Business Development Key Team Members Technology Founded November 2005 Headquarters in Foster City, CA Series A Financing: May 2006 10 Employees (and growing!)
37
Michael Franklin EECS BEARS Conference - February 2007 Conclusions Structured data increasingly important. In fact, there will be lots more of it. and it must be processed as fast as it is created. Traditional (structured) database technology is not up to the task. Great opportunities for innovation. HiFi, Dataspaces (and Amalgamated Insight!) are examples. http://www.cs.berkeley.edu/~franklin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.