The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Database System Concepts and Architecture
Introduction to Databases
Design Considerations for High Fan-in Systems: The HiFi Approach Presented by Shawn Jeffery CIDR‘05 1/7/05 Michael J. Franklin, Shawn R. Jeffery, Sailesh.
Ch. 7. Architecture Standardization for WoT
ICS (072)Database Systems: A Review1 Database Systems: A Review Dr. Muhammad Shafique.
UC Berkeley Scalable Structured Data Storage for Web 2.0 Michael Armbrust David Zhu Barret Rhoden.
Xyleme A Dynamic Warehouse for XML Data of the Web.
The Cougar Approach to In-Network Query Processing in Sensor Networks By Yong Yao and Johannes Gehrke Cornell University Presented by Penelope Brooks.
File Systems and Databases
Business Intelligence Michael Gross Tina Larsell Chad Anderson.
Research on Intelligent Information Systems Himanshu Gupta Michael Kifer Annie Liu C.R. Ramakrishnan I.V. Ramakrishnan Amanda Stent David Warren Anita.
The Sibdata Revolution September 2009 Nick Roussopoulos DCS & UMIACS & Univ. of Maryland.
CS538: Advanced Topics in Information Systems. 2 Secure Location transparency Consistent Real-Time Available Black Box: Distributed Storage [GMM] ? Data.
A Survey of Wireless Sensor Network Data Collection Schemes by Brett Wilson.
Organizing Data & Information
Declarative Support for Sensor Data Cleaning Shawn Jeffery Gustavo Alonso Michael Franklin Wei Hong Jennifer Widom UC Berkeley ETH Zurich UC Berkeley Arch.
Sensor Networks: Implications for Database Systems and Vice-Versa Michael Franklin January UCB Sensor Day.
Abstractions for Shared Sensor Networks DMSN September 2006 Michael J. Franklin.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
HiFi: Network-centric Query Processing in the Physical World SAP Research Forum February 2005 Mike Franklin UC Berkeley.
Adaptive Cleaning for RFID Data Streams VLDB /12/06 Shawn Jeffery Minos Garofalakis Michael Franklin UC Berkeley Intel Research Berkeley UC Berkeley.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Databases Chapter 11.
 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD
4/20/2017.
Chapter 1 Overview of Databases and Transaction Processing.
© 2011 IBM Corporation Smarter Software for a Smarter Planet The Capabilities of IBM Software Borislav Borissov SWG Manager, IBM.
Database Design - Lecture 1
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
1 Introduction An organization's survival relies on decisions made by management An organization's survival relies on decisions made by management To make.
Web-Enabled Decision Support Systems
An Overview of MPEG-21 Cory McKay. Introduction Built on top of MPEG-4 and MPEG-7 standards Much more than just an audiovisual standard Meant to be a.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
OnLine Analytical Processing (OLAP)
Personal Activity Coordinator Shelley Zhuang Computer Science Division U.C. Berkeley Ericsson Workshop August 2000.
XML (with a bias towards query language issues) A boring research topic? A new frontier? A means to keep standards people busy? Prepared by S. Abiteboul.
Network Computing Laboratory HiFi Systems: Network-Centric Query Processing for the Physical World Michael J. Franklin, Shawn R. Jeffrey, et al UC Berkeley.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
Event Processing A Perspective From Oracle Dieter Gawlick, Shailendra Mishra Oracle Corporation March,
The Future of Data Management or The Structure of (Computer) Scientific Revolutions EECS BEARS Conference February 2007 Michael Franklin UC Berkeley &
ICS (072)Database Systems: An Introduction & Review 1 ICS 424 Advanced Database Systems Dr. Muhammad Shafique.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Creating a Data Warehouse Data Acquisition: Extract, Transform, Load Extraction Process of identifying and retrieving a set of data from the operational.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Component 11/Unit 8a Introduction to Data
Managing Semi-Structured Data. Is the web a database?
1 Querying the Physical World Son, In Keun Lim, Yong Hun.
Chapter 1: Introduction. 1.2 Database Management System (DBMS) DBMS contains information about a particular enterprise Collection of interrelated data.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Copyright © 2004 Pearson Education, Inc. Chapter 1 Introduction and Conceptual Modeling.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Research Directions in Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros.
Big Data Analytics Are we at risk? Dr. Csilla Farkas Director Center for Information Assurance Engineering (CIAE) Department of Computer Science and Engineering.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
Chapter 1 Overview of Databases and Transaction Processing.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Christoph F. Eick: Final Words COSC Topics Covered in COSC 3480  Data models (ER, Relational, XML)  Using data models; learning how to store real.
March 8, 2007 From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System Jens Dittrich Lukas Blunschi.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Software Design and Architecture
Database Management System (DBMS)
Software Defined Networking (SDN)
Business Intelligence
Research on Personal Dataspace Management
Is the WWW a DBMS? = Fairly sophisticated search available
Presentation transcript:

The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight

Michael Franklin Dow Jones EV Summit May 2006 Data Management: Then Structured Data Processing

Michael Franklin Dow Jones EV Summit May 2006 Data Management: Now

Michael Franklin Dow Jones EV Summit May 2006 The Structure Spectrum Structured data (schema-first) regular, known, conforming, … e.g., Relational database Unstructured data (schema-never) freeform, irregular, e.g., plain text, images, audio, … Semi-structured data (schema-later) Provides structural information, but less constrained. e.g., XML, tagged text/media

Michael Franklin Dow Jones EV Summit May 2006 Whither Structured Data? Conventional Wisdom: ~20% of data is structured currently. Consumer apps, enterprise search, media apps are placing downward pressure on this.

Michael Franklin Dow Jones EV Summit May 2006 A Contrarian View? Two reasons why structured data is where the action will be: The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! The Data Integration quagmire: structure provides crucial cues for making data usable.

Michael Franklin Dow Jones EV Summit May 2006 The New Landscape Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect Mainframes 1960s Minicomputers 1970s Microcomputers/PCs 1980s Web-based computing 1990s Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications for Operational Visibility, monitoring, and alerting.

Michael Franklin Dow Jones EV Summit May 2006 Data Streams  Data Flood Clickstream Barcodes PoS System Sensors RFID Telematics Inventory Exponential data growth New challenges: continuous, inter- connected, distributed, physical Shrinking business cycles More complex decisions Phones Transactional Systems

Michael Franklin Dow Jones EV Summit May 2006 State of the Art Custom-coded implementations that are expensive and often unsuccessful. Can we develop the right infrastructure to support large-scale data streaming apps?

Michael Franklin Dow Jones EV Summit May 2006 High Fan In Systems A data management infrastructure for large-scale data streaming environments. Uniform Declarative Framework Every node is a data stream processor that speaks SQL-ese  stream-oriented queries at all levels Hierarchical, stream-based views as an organizing principle. Can impose a “view” over messy devices.

Michael Franklin Dow Jones EV Summit May 2006 HiFi - Taming the Data Flood Receptors Warehouses, Stores Dock doors, Shelves Regional Centers Headquarters Hierarchical Aggregation Spatial Temporal In-network Stream Query Processing and Storage Fast Data Path vs. Slow Data Path

Michael Franklin Dow Jones EV Summit May 2006 Device Issues: example Shelf RIFD Test - Ground Truth

Michael Franklin Dow Jones EV Summit May 2006 Actual RFID Readings “Restock every time inventory goes below 5”

Michael Franklin Dow Jones EV Summit May 2006 Query-based Data Cleaning Point Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)

Michael Franklin Dow Jones EV Summit May 2006 Query-based Data Cleaning Point Smooth Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))

Michael Franklin Dow Jones EV Summit May 2006 After Query-based Cleaning “Restock every time inventory goes below 5”

Michael Franklin Dow Jones EV Summit May 2006 Once you have the right abstractions… “Soft Sensors” Quality and lineage Optimization (power, etc.) Pushdown of external validation information Data archiving Model-based sensing Imperative processing …

Michael Franklin Dow Jones EV Summit May 2006 Data Integration Integration is the ultimate schema-first problem. Structure is both a key enabler and a key impediment here.

Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query

Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Michael Franklin Dow Jones EV Summit May 2006 Search vs. Query “Search” can return only what’s been previously “stored”.

Michael Franklin Dow Jones EV Summit May 2006 Also… What if you wanted to find out the average donation of actors to each candidate? What if you wanted to compare actor donations this campaign to the last one? What if you wanted to find out who gave the most to each candidate? What if you wanted to know where the information came from, and how old it was?

Michael Franklin Dow Jones EV Summit May 2006 A “Deep-Web” Query Approach SELECT y.name,f.occupation,… FROM Yahoo_Actors y, FECInfo f WHERE y.name = f.name

Michael Franklin Dow Jones EV Summit May 2006 “Yahoo Actors” JOIN “FECInfo” Q: Did it Work?

Michael Franklin Dow Jones EV Summit May 2006 The Fundamental Tradeoff Level of Functionality Time (and cost) Structured (schema-first) Unstructured (schema-less) Semi-Structured (schema-later) Structure enables computers to help users manipulate and maintain the data.

Michael Franklin Dow Jones EV Summit May 2006 Dataspaces* Deal with all the data from an enterprise – in whatever form Data co-existence no integrated schema, no single warehouse Pay-as-you-go services Keyword search is bare minimum. Data manipulation and increased consistency as you add work. * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.

Michael Franklin Dow Jones EV Summit May 2006 Dataspaces vs. Databases Data Coexistence Autonomous Sources Search, Browse, Approximate Answer Best Effort Guarantees Single Schema Centralized Administration Structured Query Strict Integrity Constraints

Michael Franklin Dow Jones EV Summit May 2006 The World of Dataspaces HighLow Near Far Desktop Search Web Search Virtual Organization Federated DBMS DBMS Semantic Integration Administrative Proximity

Michael Franklin Dow Jones EV Summit May 2006 Conclusions Structured data not going away. In fact, there will be lots more of it. and it must be processed as fast as it is created. Structure is crucial for successful data integration and manipulation. Much effort will be expended to add structural information to text and media. Traditional (structured) database technology is not up to the task. Great opportunities for innovation. HiFi and Dataspaces are examples.