Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight.

Similar presentations


Presentation on theme: "The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight."— Presentation transcript:

1 The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight

2 Michael Franklin Dow Jones EV Summit May 2006 Data Management: Then Structured Data Processing

3 Michael Franklin Dow Jones EV Summit May 2006 Data Management: Now

4 Michael Franklin Dow Jones EV Summit May 2006 The Structure Spectrum Structured data (schema-first) regular, known, conforming, … e.g., Relational database Unstructured data (schema-never) freeform, irregular, e.g., plain text, images, audio, … Semi-structured data (schema-later) Provides structural information, but less constrained. e.g., XML, tagged text/media

5 Michael Franklin Dow Jones EV Summit May 2006 Whither Structured Data? Conventional Wisdom: ~20% of data is structured currently. Consumer apps, enterprise search, media apps are placing downward pressure on this.

6 Michael Franklin Dow Jones EV Summit May 2006 A Contrarian View? Two reasons why structured data is where the action will be: The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! The Data Integration quagmire: structure provides crucial cues for making data usable.

7 Michael Franklin Dow Jones EV Summit May 2006 The New Landscape Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect Mainframes 1960s Minicomputers 1970s Microcomputers/PCs 1980s Web-based computing 1990s Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications for Operational Visibility, monitoring, and alerting.

8 Michael Franklin Dow Jones EV Summit May 2006 Data Streams  Data Flood Clickstream Barcodes PoS System Sensors RFID Telematics Inventory Exponential data growth New challenges: continuous, inter- connected, distributed, physical Shrinking business cycles More complex decisions Phones Transactional Systems

9 Michael Franklin Dow Jones EV Summit May 2006 State of the Art Custom-coded implementations that are expensive and often unsuccessful. Can we develop the right infrastructure to support large-scale data streaming apps?

10 Michael Franklin Dow Jones EV Summit May 2006 High Fan In Systems A data management infrastructure for large-scale data streaming environments. Uniform Declarative Framework Every node is a data stream processor that speaks SQL-ese  stream-oriented queries at all levels Hierarchical, stream-based views as an organizing principle. Can impose a “view” over messy devices.

11 Michael Franklin Dow Jones EV Summit May 2006 HiFi - Taming the Data Flood Receptors Warehouses, Stores Dock doors, Shelves Regional Centers Headquarters Hierarchical Aggregation Spatial Temporal In-network Stream Query Processing and Storage Fast Data Path vs. Slow Data Path

12 Michael Franklin Dow Jones EV Summit May 2006 Device Issues: example Shelf RIFD Test - Ground Truth

13 Michael Franklin Dow Jones EV Summit May 2006 Actual RFID Readings “Restock every time inventory goes below 5”

14 Michael Franklin Dow Jones EV Summit May 2006 Query-based Data Cleaning Point Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)

15 Michael Franklin Dow Jones EV Summit May 2006 Query-based Data Cleaning Point Smooth Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))

16 Michael Franklin Dow Jones EV Summit May 2006 After Query-based Cleaning “Restock every time inventory goes below 5”

17 Michael Franklin Dow Jones EV Summit May 2006 Once you have the right abstractions… “Soft Sensors” Quality and lineage Optimization (power, etc.) Pushdown of external validation information Data archiving Model-based sensing Imperative processing …

18 Michael Franklin Dow Jones EV Summit May 2006 Data Integration Integration is the ultimate schema-first problem. Structure is both a key enabler and a key impediment here.

19 Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

20 Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query

21 Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

22 Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query “Search” can return only what’s been previously “stored”.

23 Michael Franklin Dow Jones EV Summit May 2006 Also… What if you wanted to find out the average donation of actors to each candidate? What if you wanted to compare actor donations this campaign to the last one? What if you wanted to find out who gave the most to each candidate? What if you wanted to know where the information came from, and how old it was?

24 Michael Franklin Dow Jones EV Summit May 2006 A “Deep-Web” Query Approach SELECT y.name,f.occupation,… FROM Yahoo_Actors y, FECInfo f WHERE y.name = f.name

25 Michael Franklin Dow Jones EV Summit May 2006 “Yahoo Actors” JOIN “FECInfo” Q: Did it Work?

26 Michael Franklin Dow Jones EV Summit May 2006 The Fundamental Tradeoff Level of Functionality Time (and cost) Structured (schema-first) Unstructured (schema-less) Semi-Structured (schema-later) Structure enables computers to help users manipulate and maintain the data.

27 Michael Franklin Dow Jones EV Summit May 2006 Dataspaces* Deal with all the data from an enterprise – in whatever form Data co-existence no integrated schema, no single warehouse Pay-as-you-go services Keyword search is bare minimum. Data manipulation and increased consistency as you add work. * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.

28 Michael Franklin Dow Jones EV Summit May 2006 Dataspaces vs. Databases Data Coexistence Autonomous Sources Search, Browse, Approximate Answer Best Effort Guarantees Single Schema Centralized Administration Structured Query Strict Integrity Constraints

29 Michael Franklin Dow Jones EV Summit May 2006 The World of Dataspaces HighLow Near Far Desktop Search Web Search Virtual Organization Federated DBMS DBMS Semantic Integration Administrative Proximity

30 Michael Franklin Dow Jones EV Summit May 2006 Conclusions Structured data not going away. In fact, there will be lots more of it. and it must be processed as fast as it is created. Structure is crucial for successful data integration and manipulation. Much effort will be expended to add structural information to text and media. Traditional (structured) database technology is not up to the task. Great opportunities for innovation. HiFi and Dataspaces are examples.


Download ppt "The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight."

Similar presentations


Ads by Google