The Sibdata Revolution September 2009 Nick Roussopoulos DCS & UMIACS & Univ. of Maryland.

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Transaction.
Objectives In this session, you will learn to:
ICS (072)Database Systems: A Review1 Database Systems: A Review Dr. Muhammad Shafique.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
Distributed Database Management Systems
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Chapter 14 The Second Component: The Database.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
P2p, Fall 06 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight.
Knowledge Portals and Knowledge Management Tools
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Understanding Data Warehousing
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2012 (September 5, 2012)
The Worlds of Database Systems Chapter 1. Database Management Systems (DBMS) DBMS: Powerful tool for creating and managing large amounts of data efficiently.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
SCSC 311 Information Systems: hardware and software.
3G Single Core Modem A New Telecommunications Device Group 4: Warren Irwin, Austin Beam, Amanda Medlin, Rob Westerman, Brittany Deardian.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Session-8 Data Management for Decision Support
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
ICS (072)Database Systems: An Introduction & Review 1 ICS 424 Advanced Database Systems Dr. Muhammad Shafique.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
1/22/08 RTR Project Presentation to TPTF RTR Project Michael Daskalantonakis & Brian Cook.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Querying The Internet With PIER Nitin Khandelwal.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
12 Oct 2003VO Tutorial, ADASS Strasbourg, Data Access Layer (DAL) Tutorial Doug Tody, National Radio Astronomy Observatory T HE US N ATIONAL V IRTUAL.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
IT Enablement Approaches Large Business may have hundreds of processes to be enabled by IT. Several Types of Application may be deployed –Departmental.
Introduction to Active Directory
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Peer-to-Peer Data Management
1st Draft for Defining IoT (1)
Software Design and Architecture
CHAPTER 2 CREATING AN ARCHITECTURAL DESIGN.
The Top 10 Reasons Why Federated Can’t Succeed
System And Application Software
Metadata The metadata contains
DATABASE TECHNOLOGIES
Presentation transcript:

The Sibdata Revolution September 2009 Nick Roussopoulos DCS & UMIACS & Univ. of Maryland

Nick Roussopoulos Data Management: Past to Current Structured Data Structured architectures

Nick Roussopoulos Data Management: Huh???

Nick Roussopoulos The Landscape Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect Mainframes 1960s Minicomputers 1970s Microcomputers/PCs 1980s Web-based computing 1990s Devices (Smart phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications that Mandate new data management methods & tools.

Nick Roussopoulos Data Then and Now The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! The Data Integration quagmire: 40 years of continuous successes (sic) and still a long way to the end. Structure provides crucial understanding for making data usable and leads to discovery/innovation.

Nick Roussopoulos Data Streaming  Data Explosion Clickstream Barcodes PoS System Sensors RFID Telematics Inventory Exponential data growth New challenges: continuous, inter- connected, distributed, physical Shrinking business cycles More complex decisions Phones Transactional Systems

Nick Roussopoulos The Structure Spectrum Structured data (schema-first) regular, known, conforming, … e.g., Relational database Unstructured data (schema-never) freeform, irregular, e.g., plain text, images, audio, … Semi-structured data (schema-later) Provides structural information, but less constrained. e.g., XML, tagged text/media

Nick Roussopoulos Data Integration Integration is the ultimate schema-first problem. Requires complete understanding & disambiguation Structure (semantics) is both a key enabler and a key impediment here.

Nick Roussopoulos Structured Data: How much Conventional Wisdom: ~20% of data is structured currently. Consumer apps, enterprise search, multimedia apps are placing downward pressure on this.

Nick Roussopoulos State of the Art: Integration-in-the-large Team work, huge & expensive effort, excruciating pain Extremely long time lag between data generation and availability Custom-coded implementations that are often unsuccessful Clearing house of already discovered knowledge (the high overhead is for disambiguating the semantics of the heterogeneous data)

Nick Roussopoulos Future: Integration-in-the-small End-user, limited in scope, requires training Continuous as the data sources and equipment evolve End-user tools are needed Small cost, enormous opportunity for discovery and innovation

Nick Roussopoulos Sibling Data Aggregation and naming of disparate data regardless location Includes actual data, references to external data, queries that generate data, & programs to process data May include other sibdata Open vs Closed Open: continuous accumulation Closed: fixed snapshot (archival) Location Independent semantics

Nick Roussopoulos Web search results

Nick Roussopoulos Content vs URL Content  moore.com/

Nick Roussopoulos Deep-Web Queries SELECT y.title FROM Yahoo_Movies m WHERE m.title like Moore;

Nick Roussopoulos Result vs. Query Results are associated with the time the query was run Queries can be captured in sibdata and executed at will; thus the sibdata would be open and captures a different result each time it executes

Nick Roussopoulos Queries to Relational Databases Yahoo_Actors

Nick Roussopoulos Sibdata Deal with all the data from everywhere & in whatever form they come Data co-existence no integrated schema, no single warehouse Expand-as-you-go Integrate little by little as you need ETL Data mapping-integrating as you add more data

Nick Roussopoulos Sibdata Properties Lightweight Metadata captures the encapsulation, name, and provenance data Location-independent Accessible from anywhere Isolated Generated with no interference Durable Persist until dropped Secure Guarantee security defined by the creators and sources Compose multiple levels of security to its components

Nick Roussopoulos Comparison to Transactions Transactions grouping of many actions into an atomic transaction- ACID properties Substrate: database Sibdata Grouping of data into an atomic sibdata – LLADS Substrate: actions/transactions/data generators

Nick Roussopoulos Sibdata Infrastructure

Nick Roussopoulos Sibdata Servers Establish a global sibdata ID and name Creates and maintains metadata with provenance, users, security, etc. Provides searchable catalog Provides storage for non-sib compliant data sources Fault tolerance (replication)

Nick Roussopoulos Sib Protocols Establish Sibdata protocol Concurrency-Consistency issues (?) Sharing of data Name conventions Dispute resolution Distributed Logging Security Using chits Group and multi-valued ownership and visibility

Nick Roussopoulos User Interface Simple OS support Query Languages Graphical Languages ETL tools Extra functionality High dimensional indexing Mining

Nick Roussopoulos Conclusions Need to build Sib Infrastructure Refine the sibdata semantics Refine the security protocols For data aggregates User groups Great opportunities for innovation

Nick Roussopoulos Presentations & Project 3 X 7 students = 21 presentations ~2 per lecture Lecture dates Sep: 15, 22, 29 Oct: 6, 13, 20, 27 Nov: 3, 10, 17, 24 Dec: 1, 8 Project: Proposal due Sep 29 Discussion: Every lecture be prepared to give a 2-3 min progress report, papers found, etc.

Nick Roussopoulos Network Data Independence Hellerstein Berkeley Physical Data Independence Decoupling data from layout (not hard coded applications) Permits reorganization of data w/o affecting the apps Declarative query languages Using the schema Distributed Databases Transparency hides location from the user who acts as if he is accessing a centralized database Limited sites- not capable to expand to the mobility of and constant change of the configuration

Nick Roussopoulos Pilars of Data independence Indexes- offer indirection allowing modification of the underlying structure Schema based and declarative query languages & optimization table R 1 occurrence file

Nick Roussopoulos Sibdata Independence Encapsulation of dissimilar data Data can be moved, rearranged, altered Additional indices on top of Sibdata becomes part of the sibdata Naming and provenance data are fixed Do not change to the outside world Containment information (sibdata encapsulation within other sibdata) is guaranteed

Nick Roussopoulos DHT (Chord) Data centric distribution according to content- total data independence very large number of distributed servers Configuration changes rapidly (although this may not be really that important) Fault-tolerance (extra machines) Limited to single key searches (not range or join queries

Nick Roussopoulos Network Names & Services Internet Indirection Infrastructure (i3) Triggers (id,r) where id = global ID and r is an address to forward packets When a mobile user moves to r’, he modifies his trigger to (id,r’) It also supports 1-to-n mappings (anycast) Content Distribution Networks (Akamai) Replicates heavy data (images, videos) to multiple sites and redirects user accesses to those that are closer (indirection via location independence)

Nick Roussopoulos Relevant DB Technologies Distributed Aggregation Monitor networks (collecting stats) Computing synopses and pass it along Adaptive execution plans Feedback to the execution Commutative tasks to avoid extended delays Range search over DHT Trie hashing Still limited P2P & Mobile Databases

Nick Roussopoulos Pier: A P2P in situ Query Engine Goals Massively distributed processing Scallability Relaxed consistency (best effort) Architecture P2P Built on top of DHT Multicast to all related nodes (lscan) Pipelining the intermediate results

Nick Roussopoulos Pier Joins Stored in DHT Namespace=relation NR, NS resourceID =Primary Key (PK) instanceID =tuple # if not a PK Assume R and S are already DHT hashed using and Symmetric Join building phase lscan NR and NS eliminate unqualified tuples and not needed attributes Rehash all above tuples using namespace NQ resourceID=R.pkey*S.pkey Tuples are tagged with relation name SymmetricJoin Probing phase Probing in parallel with building (with callbacks) locally Satisfying tuples are either sent to the Qsite or DHT-ed for the pipelined op Consumes a lot of bandwidth

Nick Roussopoulos Better Joins Fetch Matches Hash only S lscan R and fetch NS tuples Rewriting Join using 2-way semijoin Project R & R on their PK and joining attribute Do symmetric join on these projections Rewriting Join using Bloom filters Create and DHT the Bloom filters Do lscan and access the Bloom filter to eliminate not joinable tuples

Nick Roussopoulos Conclusions for Pier P2P bring massive parallelism Repetitive data comparison over DHT brings along massive waste of bandwidth Smarter in situ distillation (2-way semijoins, Bloom filters) work better