Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Chapter 15: Transactions Transaction Concept Transaction Concept Concurrent Executions Concurrent Executions Serializability Serializability Testing for.
CS4432: Database Systems II
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
Fast Algorithms For Hierarchical Range Histogram Constructions
Greed is good. (Some of the time)
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Best-First Search: Agendas
Efficient Query Evaluation on Probabilistic Databases
Probabilistic RDF Octavian Udrea 1 V.S. Subrahmanian 1 Zoran Majkić 2 1 University of Maryland College Park 2 University “La Sapienza”, Rome, Italy.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Anagh Lal Monday, April 14, Chapter 9 – Tree Decomposition Methods Anagh Lal CSCE Advanced Constraint Processing.
An Ontology-Extended Relational Algebra Piero Bonatti Università di Napoli "Federico II" Yu Deng V.S. Subrahmanian University of Maryland College Park.
Data Flow Analysis Compiler Design Nov. 3, 2005.
1 COS 425: Database and Information Management Systems XML and information exchange.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Carnegie Mellon AISTATS 2009 Jonathan Huang Carlos Guestrin Carnegie Mellon University Xiaoye Jiang Leonidas Guibas Stanford University.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Important Problem Types and Fundamental Data Structures
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.
Querying Structured Text in an XML Database By Xuemei Luo.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 10.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Union-find Algorithm Presented by Michael Cassarino.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Topological Sort (an application of DFS) CSC263 Tutorial 9.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Session 1 Module 1: Introduction to Data Integrity
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Incomplete Answers over Semistructured Data Kanza, Nutt, Sagiv PODS 1999 Slides by Yaron Kanza.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Probabilistic Data Management
Topological Sort (an application of DFS)
A Normal Form for XML Documents
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Topological Sort (an application of DFS)
Presentation transcript:

Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Motivation  Query algebras do not take semantics into account when computing answers  Data is not always precise Ambiguity, insufficient information  Goal: Use probabilistic ontologies to improve query answer recall and quality

The probabilistic solution  Compute and return answers with high probability ( > p thr )  Keep probabilities hidden from the user  Problems How do we assign a probability to each data item? How do we choose p thr ?

Concepts  Constraint probabilistic ontologies Is-a graph with edges labeled with probabilities Including conditional probabilities Disjoint decompositions  Ontologies associated with terms in a data source Attributes in a relation/XML Propositional entities in text sources

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Running example fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Example: decompositions fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Example: probability labels fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Example: conditional probabilities fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Running example: Sample queries  “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”  What type of board meeting is being discussed? Since Ed Masters is present, there is a 75% probability it is a board of directors meeting  What type of financial unit is referenced? Since the subject is marketing policy, there is a 65% probability it is the Financial Review Board.

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Technical preliminaries: POB  POB schema: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions) for each node  me(OrganizationUnit) = {{Comittee, Board, Team, Department}, {Legal, Executive, Financial, Marketing}} maps each edge in to a positive rational number in [0,1]

Back to the example fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Constraint probabilities  Simple constraints: Only for entities NOT represented in the current ontology Nil constraint:  Constraint probabilities: Pair, with p in [0,1] and a conjunction of simple constraints

Labeling  Labeling should not be arbitrary Invalid labeling may lead to time-consuming consistency algorithms And to ambiguity in interpreting query answers  Valid labeling: No constraint refers to the entities associated with this ontology There is exactly one nil constraint probability on each edge

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

The CPO model  CPO: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions) for each node is a valid labeling for  Note there is no condition on the probabilities....yet!

CPO enhanced data sources  Associate CPOs with some attributes of a relation.  Associate CPOs with elements in an XML data store.  Associate CPOs with some keywords for text files.  CPO k At most k probabilities on each edge CPO 1 is a POB

Answering queries  “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”  What type of board meeting is being discussed? Since Ed Masters is present, there is a 75% probability it is a board of directors meeting  Goal: Associate probabilities with possible answers.

Probability path fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

Probability path  if: c => c 1 => c 2 => … => c k => d f is a function defined on the chain  f selects one probability on each edge  is the set of constraints selected by f along with the probabilities

CPO consistency  CPO  An arbitrary universe of objects O  Interpretation ε is a mapping from C to 2 O  ε is a taxonomic model iff: We assign objects to each class Objects cannot be shared between classes in the same cluster => edges imply subset relations on the sets of objects assigned to each class If A => B is labeled with probability p, at least p percent of objects in A are also assigned to B

CPO consistency (cont’d)  CPO consistent  it has a taxonomic probabilistic model  Deciding if a CPO is consistent is NP-complete The weight formula satisfiability problem. A non-deterministic algorithm for consistency checking is straightforward.

Consistency approach  Identify a subclass of CPOs for which we can check consistency  Two parts: Pseudoconsistency – this was done for POBs Well-structuredness – particular to CPOs

Pseudoconsistent CPO  CPO  No two classes in the same cluster have a common subclass  The graph is rooted  For every immediate distinct subclasses of c, they either: Have no common subclass Have a greatest common subclass different from them  No cycles  If c inherits from multiple clusters, all paths from descendants of c to the root go through c

Pseudoconsistency

Weight factor  A set P of not-nil constraint probabilities If P is the empty set, w f (P) = 0 If P = {(p,γ)}, w f (P) = p w f (P U Q) = w f (P) + w f (Q) – w f (P) * w f (Q)  Intuitive meaning: how many objects from class A do I have to assign to class B and satisfy the constraints?

More weight factors  CPO  c => d an edge  We write:  We define:  Result: Conditions of taxonomic interpretation can be satisfied by selecting at most w(c,d)*|O d | objects from d into c.

Well-structured CPO  Conditional constraints on edges from the same cluster must be disjoint Otherwise, impossible to cpumte a weight factor for the cluster edges.  The sum of the weight factors for edges in a cluster is ≤ 1

Well-structuredness

Consistent CPOs revisited  A pseudoconsistent and well- structured CPO is consistent Pseudoconsistency accounts for most of the conditions in the taxonomic interpretation Well-structuredness accounts for the the assignment of objects to subclasses

Consistency checking algorithm  Pseudoconsistency is O(n 2 e) and well-structuredness is O(n 2 k 2 ) n – number of classes e – the number of edges k – the order of the CPO  Algorithm based on: Topological sort Dijskstra and derivatives

CPO enhanced algebras  CPO enhanced algebras formally defined for: Relational data sources XML data stores Selection, projection, product, join, etc.  Ongoing work: RDF ehanced query algebra Directly related to RDF extraction from text.

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

CPO integration: motivation from ACME corp. to EVIL corp.: “During you last FO board meeting, the rising costs of quality assurance were not addressed. We would like to include this in our next auditing comittee meeting.... ACME corp. CPOEVIL corp. CPO

Merging CPOs Two scenarios: One data source that refers to similar entities but from different application domains. Example: ACME – EVIL correspondence Queries across multiple data sources Example: Two different CPOs associated with distinct relations during a join query.

Interoperation constraints  Since the CPOs being merged refer to similar entities, some classes may be euqivalent Equality constraints c 1 :=:c 2 Possiblity: immediate subclassing constraints Not really used – hardly feasible

The integration problem  Two CPOs S 1 = (C 1, => 1, me 1, φ 1 ), S 2 = (C 2, => 2, me 2, φ 2 )  Set of interoperation constraints I  An integration witness is another CPO S = (C, =>, me, φ) that satisifes S 1, S 2 and I

Integration witness  Every class c in C 1 U C 2 Appears in C OR c:=:d appears in I and d є C i.e. no classes get “lost”  Similarly, no edges are lost  No constraints are lost If two identical constraint probabilities are on the same edge in both CPOs, take a probability p between the two

Integration witness  Immediate subclassing constraints add edges to S  No cluster can be split as a result of merging  S is pseudoconsistent and well- structured (if it’s not, it’s of no use) Open problem: If it is not, how can we minimally change it such that it has these properties?

CPOmerge algorithm  CPOmerge produces an integration witness if exists  O(n 3 ) – costly  In pratice, much more efficient through: Caching Some properties are preserved if the original ontologies are pseudo- consistent and well-structured

Who writes the interop constraints?  User – not feasible  How to infer them?  Intuitive solution: If enough neighbours are in equality constraints, then infer respective nodes should be equivalent. But we still need some equivalence constraints to get started – use lexical distance How many neighbors are “enough”?

ICI – Simple solution  Neighbor: parent, immediate child, sibling from the same cluster  We define  n e – number of neighbors in equality constraints  n c,d – number of neighbors of c,d  Why? Number of equal neighbors / Total number of neighbors (including self).  Always < 1  ICI algorithm: if p e exceeds threshold, assume they are equal Start with lexical distance

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Give me a CPO…  Very little work so far on probabilistic ontologies. Nothing resembling CPOs around  How do we infer them: How do we build disjoint decompositions? How do we infer probabilities?

Building disjoint decompositions  Take regular ontologies from the Web Many sources: daml.org, SchemaWeb, OntoBroker  Modify CPOmerge to ignore labeling  The merge result will contain disjoint decompositions  Equality constraints can be inferred through ICI

Infer probabilities – simple methods  Simple methods: Distribute probabilities uniformly within each cluster For each cluster L in me(c), d=>c, For any distance function (lexical or otherwise)

Advanced methods  Probabilistic relational models with structural uncertainty Work by Dr. Getoor et. al  Classification approach Feature extraction determines entities of interest Create conditional probabilities on those entities  User feedback approach General, applicable to any of the above (ongoing work)

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Experimental setup  Java implementation  CPO enhanced relational DB Movies database maintained by Dr. Wiederhold IMDB data  IMDB to estimate recall  Classifications from the Web to build initial CPO

Consistency check & inference

Recall

Precision

Answer quality

Query running time

ICI quality

Bottomline  Clear improvement in query answer quality Some time penalty, but reasonable  Very little user intervention  CPOs are suited for a wide variety of data sources Potentially, they can be used to convey semantics across heterogenous data sources

Content  Motivation and goals  Running example  Technical preliminaries  CPO model  CPO integration  CPO inference algorithms  Experimental results  Ongoing work

Current experimental setup  DBLP data over 60 years of scientific publications XML data set  CPOs from complex ontologies DBLP classification ACM classification of subjects

Goals (1)  Determine the efficiency of advanced CPO inference methods  Experimentally determine the best approach in terms of minimizing user feedback

Goals (2)  Use CPOs with RDF databases For extracting RDF from text as a means of using semantic information For answering queries from RDF databases  Benefits: Probabilistic model is clearly formalized Proven improvement in answer quality  Experimentally determine what the probability threshold may be for various domains

Thank you