A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Three-Step Database Design
1 Probability and the Web Ken Baclawski Northeastern University VIStology, Inc.
Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington.
Model Counting of Query Expressions: Limitations of Propositional Methods Paul Beame 1 Jerry Li 2 Sudeepa Roy 1 Dan Suciu 1 1 University of Washington.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.
RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
“Lineage/Provenance” Workgroup Report Birgitta, Amol, Ihab, Thomas, Anish, Martin, Matthias.
Creating Probabilistic Databases from IE Models Olga Mykytiuk, 21 July 2011 M.Theobald.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Efficient Query Evaluation on Probabilistic Databases
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Exact Model Counting: limitations of SAT-solver based methods Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington [UAI 13], [ICDT 14]
Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
1 Management of Probabilistic Data: Foundations and Challenges Nilesh Dalvi and Dan Suciu Univerisity of Washington.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Relational Model & Relational Algebra. 2 Relational Model u Terminology of relational model. u How tables are used to represent data. u Connection between.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
Unifying Data and Domain Knowledge Using Virtual Views IBM T.J. Watson Research Center Lipyeow Lim, Haixun Wang, Min Wang, VLDB Summarized.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2015.
Performing Bayesian Inference by Weighted Model Counting Tian Sang, Paul Beame, and Henry Kautz Department of Computer Science & Engineering University.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Event Detection and Notification in the World-Wide Sensor Web Magdalena Balazinska with Evan Welbourne, Garret Cole, Nodira Khoussainova, Julie Letchner,
CSC271 Database Systems Lecture # 7. Summary: Previous Lecture  Relational keys  Integrity constraints  Views.
Model Counting of Query Expressions: Limitations of Propositional Methods Paul Beame 1 Jerry Li 2 Sudeepa Roy 1 Dan Suciu 1 1 University of Washington.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Chapter 13: Query Processing
A Course on Probabilistic Databases
Probabilistic Data Management
Approximate Lineage for Probabilistic Databases
Probabilistic Data Management
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
CS 188: Artificial Intelligence
Probabilistic Databases
Knowledge Compilation: Representations and Lower Bounds
Probabilistic Ranking of Database Query Results
Probabilistic Databases with MarkoViews
Presentation transcript:

A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1

Probabilistic Databases Data: standard relational data, plus probabilities that measure the degree of uncertainty Queries: standard SQL queries, whose answers are annotated with output probabilities June, 2014Probabilistic Databases - Dan Suciu 2

A Little History of Probabilistic DBs Early days Wong’82 Shoshani’82 Cavallo&Pittarelli’87 Barbara’92 Lakshmanan’97,’01 Fuhr&Roellke’97 Zimanyi’97 Recent work Stanford (Trio) UW (MystiQ) Cornell (MayBMS) Oxford (MayBMS) U.of Maryland IBM Almaden (MCDB) Rice (MCDB) U. of Waterloo UBC U. of Florida Purdue University U. of Wisconsin June, 2014Probabilistic Databases - Dan Suciu Main challenge: Query Evaluation (=Probabilistic Inference) 3

Why? Many applications need to manage uncertain data Information extraction Knowledge representation Fuzzy matching Business intelligence Data integration Scientific data management Data anonymization June, 2014Probabilistic Databases - Dan Suciu 4

What? Probabilistic Databases extend Relational Databases with probabilities Combine Formal Logic with Probabilistic Inference Requires a new thinking for both databases and probabilistic inference June, 2014Probabilistic Databases - Dan Suciu 5

This Course: Query Evaluation June, 2014Probabilistic Databases - Dan Suciu 6 Based on the book:

Part 1 Part 2 Part 3 Part 4 1. Motivating Applications 2. The Probabilistic Data ModelChapter 2 3. Extensional Query PlansChapter The Complexity of Query EvaluationChapter 3 5. Extensional EvaluationChapter Intensional EvaluationChapter 5 7. Conclusions Outline of the Tutorial June, 2014Probabilistic Databases - Dan Suciu 7

What You Will Learn Background: Relational data model: tables, queries, relational algebra PTIME, NP, #P Model counting: DPLL, OBDD, FBDD, d-DNNF In detail: Extensional plans, extensional evaluation, running them in postgres The landscape of query complexity: from PTIME to #P-complete, Query compilation: Read-Once Formulas, OBDD, FBDD, d-DNNF Less detail: The #P-hardness proof, complexity of BDDs Omitted: Richer data models: BID, GM, XML, continuous random values) Approximate query evaluation, Ranking query answers June, 2014Probabilistic Databases - Dan Suciu 8

Related Work. See book, plus: These references are not in the book Wegener: Branching programs and binary decision diagrams: theory and applications, 2000 Dalvi, S.: The dichotomy of probabilistic inference for unions of conjunctive queries, JACM’2012 Huang, Darwiche: DPLL with a Trace: From SAT to Knowledge Compilation, IJCAI 2005 Beame, Li, Roy, S.: Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases, UAI’13 Gatterbauer, S.: Oblivious Bounds on the Probability of Boolean Functions, under review The applications are from: Ré, Letchner, Balazinska, S: Event queries on correlated probabilistic streams. SIGMOD Conference 2008 Gupta, Sarawagi: Creating Probabilistic Databases from Information Extraction Models. VLDB 2006 Stoyanovich, Davidson, Milo, Tannen: Deriving probabilistic databases with inference ensembles. ICDE 2011 Beskales, Soliman, Ilyas, Ben-David: Modeling and Querying Possible Repairs in Duplicate Detection. PVLDB 2009 Kumar, Ré: Probabilistic Management of OCR Data using an RDBMS. PVLDB 2011 June, 2014Probabilistic Databases - Dan Suciu 9

A COURSE ON PROBABILISTIC DATABASES Lecture 1: Motivating Applications June, 2014Probabilistic Databases - Dan Suciu 10

Outline 1. Motivating Applications 2. The Probabilistic Data ModelChapter 2 3. Extensional Query PlansChapter The Complexity of Query EvaluationChapter 3 5. Extensional EvaluationChapter Intensional EvaluationChapter 5 7. Conclusions June, 2014Probabilistic Databases - Dan Suciu 11 Part 1 Part 2 Part 3 Part 4

Example 1: Information Extraction June, 2014Probabilistic Databases - Dan Suciu 52-A Goregaon West Mumbai CRF Standard DB: keep the most likely extraction Probabilistic DB: keep most/all extractions to increase recall 12 [Gupta’2006] Key finding: the probabilities given by CRFs correlate well with the precision of the extraction.

Example 2: Modeling Missing Data June, 2014Probabilistic Databases - Dan Suciu Standard DB: NULL Probabilistic DB: distribution on possible values 13 [Stoyanovich’2011] Key technique: Meta Rule Semi-Lattice for inferring missing attributes.

Example 3: Data Cleaning June, 2014Probabilistic Databases - Dan Suciu Standard DB cleaning data means choosing one possible repair Probabilistic DB keep many/all possible repairs 14 [Beskales’2009] Challenge: Representing multiple repairs. [Beskaes’2009] restrict to hierarchical repairs.

Example 4: OCR June, 2014Probabilistic Databases - Dan Suciu They use OCRopus from Google Books: output is a stohastic automaton Traditionally: retain only the Maximum Apriori Estimate (MAP) With a probabilistic database: may retain several alternative recognitions: increase recall. 15 [Kumar’2011] SELECT DocId, Loss FROM Claims WHERE Year = 2010 AND DocData LIKE '%Ford%’;

Summary of Applications Structured, but uncertain data Modeled as probabilistic data Answers to SQL queries annotated with probabilities June, 2014Probabilistic Databases - Dan Suciu 16 Probabilistic database: Combine data management with probabilistic inference