Sam Madden With a cast of many….

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Monomi: Practical Analytical Query Processing over Encrypted Data
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Distributed Data Processing
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
CryptDB: Protecting Confidentiality with Encrypted Query Processing
CryptDB: Confidentiality for Database Applications with Encrypted Query Processing Raluca Ada Popa, Catherine Redfield, Nickolai Zeldovich, and Hari Balakrishnan.
CryptDB: A Practical Encrypted Relational DBMS Raluca Ada Popa, Nickolai Zeldovich, and Hari Balakrishnan MIT CSAIL New England Database Summit 2011.
DEV392: Extending SharePoint Products And Technologies Through Web Parts And ASP.NET Clint Covington, Program Manager Data And Developer Services - Office.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Components and Architecture CS 543 – Data Warehousing.
DATA WAREHOUSING.
Chapter 14 The Second Component: The Database.
BUSINESS DRIVEN TECHNOLOGY
Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
SQL Server 2008 for Hosting Key Questions to Address How can SQL Server save your costs? How can SQL Server help you increase customer base? How can.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
Databases & Data Warehouses Chapter 3 Database Processing.
D ATABASE S ECURITY Proposed by Abdulrahman Aldekhelallah University of Scranton – CS521 Spring2015.
Chapter 1 Overview of Databases and Transaction Processing.
Protecting data privacy and integrity in clouds By Jyh-haw Yeh Computer Science Boise state University.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Data Center Infrastructure
SharePoint 2010 Business Intelligence Module 2: Business Intelligence.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Database Systems – Data Warehousing
 Prototype for Course on Web Security ETEC 550.  Huge topic covering both system/network architecture and programming techniques.  Identified lack.
Mohammad Ahmadian COP-6087 University of Central Florida.
Components of Database Management System
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IE 423 – Design of Decision Support Systems Introduction to Data Base Management Systems and MS Access.
Microsoft TechForge 2009 SQL Server 2008 Unplugged Microsoft’s Data Platform Vinod Kumar Technology Evangelist – DB and BI
June 11, 2012 Troy Bleeker. Agenda Participants will learn A cloud computing recap. What is our cloud like and why do we have it? Lab: VPN, IDs, shared.
Wai Kit Wong 1, Ben Kao 2, David W. Cheung 2, Rongbin Li 2, Siu Ming Yiu 2 1 Hang Seng Management College, Hong Kong 2 University of Hong Kong.
Identity-Based Secure Distributed Data Storage Schemes.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Data Science Background and Course Software setup Week 1.
CryptDB: Protecting Confidentiality with Encrypted Query Processing
Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
Big Data Analytics Are we at risk? Dr. Csilla Farkas Director Center for Information Assurance Engineering (CIAE) Department of Computer Science and Engineering.
1 BCS 4 th Semester. Step 1: Download SQL Server 2005 Express Edition Version Feature SQL Server 2005 Express Edition SP1 SQL Server 2005 Express Edition.
uses of DB systems DB environment DB structure Codd’s rules current common RDBMs implementations.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Application Security Lecture 27 Aditya Akella.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Cloud Computing & ANalytics
The Client/Server Database Environment
Using cryptography in databases and web applications
Capitalize on modern technology
Luke Do, Jessica Olmedo, Arely Romero, and Vianca Santana
Ch 4. The Evolution of Analytic Scalability
بررسی معماری های امن پایگاه داده از جنبه رمزنگاری
Technical Capabilities
Presentation transcript:

Sam Madden With a cast of many….

BIG MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Data

Example: Medical Costs MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY MGH Cancer Center “Super-Database ” Question: What are the factors driving costs for lung cancer patients? Some results: No correlation of cost with Stage of presentation Survival Strong correlation of cost with oncologist! Largest cancer database in the world (173,301 patients) Based on national tumor registry Cross linked with death registry Includes billing, reports, labs, imagery, genome SNPs - Dr. James Michaelson, PhD, MGH, Harvard Medical School

Super Duper Indexes MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Beyond scalable platforms Challenge: Making Data Accessible Main Memory DBsColumn Oriented DBsMap Reduce What does the data look like? How do I correlate it with other data sets? How do I present it to users/execs? Where are these anomalies and outliers coming from?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Introducing Datahub Challenge:Making Data Accessible + = Octocat, the Github mascot DB Technology

Introducing Datahub MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Data Commons Selective Sharing and Access Control Easy to Find, Combine, Clean Data Sets Secure, Hosted Data Storage (“Database Service”) Ability to Browse, Visualize, and Query Data in situ

Lots of other places to find data! For example: MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Datahub: “five-star” integrated, browse-able, & query-able repository of linked data Aka … Just a bunch of zip files ★ make your stuff available on the Web under an open license ★★ make it available as structured data ★★★ use non-proprietary formats (e.g., CSV instead of Excel) ★★★★ use URIs to denote things, so that people can point at your stuff ★★★★★ link your data to other data to provide context Versus open, linked data (Tim Berners Lee Taxonomy)

Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Anant Bhardwaj

Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“Wrangling” Features MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Wrangler: Interactive Visual Specification of Data Transformation Scripts Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer

Data Wrangling MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Post-Wrangling MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

More Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Versions Browsing and Visualization

MIT Living Lab Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY A Dogfood Eating Exercise

MIT Living Lab Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY MIT Data Hub Organizational Data Organizational Data Personal Data Public Data MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data… Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings… Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.

What Will Data Hub Enable at MIT? Campus “Quantification” –is going to class correlated with better grades? –which dining facilities are most popular amongst different groups? Transportation planning: –bus utilization and on demand routing –parking lot utilization –carpool finding, etc Health + Medical: –campus wide public health, e.g., flu tracking, –observing who is missing class, depressed –Health signals: exercise and eating habits; partners; –outpatient care Research: – expert finding; –data sharing between groups MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Challenges: It’s Not All Fuzzy Stuff Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Monomi MapD Scorpion We also don’t want our research to be like this guy 

 Confidential data leaks  2012: hackers extracted 6.5 million hashed passwords from the DB of LinkedIn Application DB Server SQL User 1 User 2 User 3 Private Data Problem System administrator Threat: passive DB server attacks Hackers Sensitive content Datahub

How to protect data confidentiality? DB Server Client Sensitive content  Encrypt data server may not be able to process queries!  Compute on encrypted data!  Without giving server encryption key! [request] [result] General approach has been proposed several times…

1. Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud 2. Modest overhead Monomi / CryptDB 3. No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications Application DB Server SQL User 1 User 2 User 3 Threat 1: passive DB server attacks Sensitive content w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich

col1/rankcol2/name table1/emp SELECT * FROM emp WHERE salary = 100 x934bc1 x5a8c34 x84a21c SELECT * FROM table1 WHERE col3 = x5a8c34 Proxy ? x5a8c34 ? x4be219 x95c623 x2ea887 x17cea7 col3/salary Application Randomized encryption Deterministic encryption SQL Queries on Encrypted Data Example

col1/rankcol2/name table1 (emp) x934bc1 x5a8c34 x84a21c x638e54 x922eb4 x1eab81 SELECT * FROM table1 WHERE col3 ≥ x638e54 Proxy x638e54 x922eb4 x638e54 col3/salary Application Deterministic encryption SELECT * FROM emp WHERE salary ≥ 100 OPE (order) encryption

Monomi: Protecting Data in Datahub Extensions to CryptDB to efficiently support OLAP queries Show how to run all of TPC-H, rather than just 4 of 22 queries – Key insight: split queries, run as much as possible on untrusted DBMS, compute remainder on trusted client

Monomi vs Plaintext TPC-H SF10, Postgres Takeaway: median overhead 1.24x, See Stephen Explain How it Really Works Right after this Talk! Monomi Runtime vs Plaintext

Many Open Problems Understanding performance more broadly How to reason about security of non-randomized schemes? Auditing, information flow, etc. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Monomi MapD Scorpion

Interactive Large-Scale Visualization using a GPU Database

The Need for Interactive Analytics DataHub needs to support browsing massive data sets Browsing is best supported through visualization  ad-hoc analytics, with millisecond response times

MapD: GPU Accelerated SQL Database Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data Not an accelerator, but a full blown query processor! Massive parallelism enables interactive browsing interfaces – 4x GPUs can provide > 1 TB/sec of bandwidth – 12 Tflops compute – Order of magnitude speedups over CPUs, when data is on GPU “Shared nothing” arrangement

Demo

Next Steps Scale out to many nodes, automate layout algorithms Add various advanced analytics (e.g., machine learning algorithms) Generalize visualization beyond maps

DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Monomi MapD Scorpion

Visual Provenance: Scorpion Visualization of data is most common form of big data analysis Common problem: outliers Would be nice to have a tool that identifies why outliers exist

Definition of Why Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY i = Input Data Output Visualization p Outlier Group p = predicate

Definition of Why Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY i = Input Data Output Visualization p p = predicate

Definition of Why Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY i = Input Data Output Visualization p Removing the predicate makes US no longer an outlier What are common properties of those records? {Bill Gates, Steve Ballmer} p: Company = MSFT

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY AVG(rows) = 2.7

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY AVG(rows) = 2.9

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY AVG(rows) = 2.2

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY AVG(rows) = 3.3

Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation Desire for simple, understandable predicates and a general purpose visualization framework MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY AVG(rows) = 3.1 … See Eugene Explain How it Really Works this Afternoon!

Next Steps MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY A general purpose visualization language for expressing visualizations with provenance support References to underlying data set

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY Big Data is a cry for help from non DB people Lots of exciting work on scalable systems DB community should be doing a much better job of helping users use data We risk losing mindshare Datahub aims to make data easy to find, visualize, and query, securely and efficiently Many fascinating, hard problems! (Monomi, MapD, Scorpion) Conclusion