Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira,

Slides:



Advertisements
Similar presentations
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Advertisements

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 16: KEYWORD SEARCH PRINCIPLES OF DATA INTEGRATION.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
1 Networking through Linux Partha Sarathi Dasgupta MIS Group Indian Institute of Management Calcutta.
Search Engines and Information Retrieval
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Aki Hecht Seminar in Databases (236826) January 2009
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Escape Routing For Dense Pin Clusters In Integrated Circuits Mustafa Ozdal, Design Automation Conference, 2007 Mustafa Ozdal, IEEE Trans. on CAD, 2009.
Search Engines and Information Retrieval Chapter 1.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
On the Construction of Data Aggregation Tree with Minimum Energy Cost in Wireless Sensor Networks: NP-Completeness and Approximation Algorithms National.
Ragib Hasan Johns Hopkins University en Spring 2010 Lecture 6 03/22/2010 Security and Privacy in Cloud Computing.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Network Aware Resource Allocation in Distributed Clouds.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Database Management 9. course. Execution of queries.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Routing Topology Algorithms Mustafa Ozdal 1. Introduction How to connect nets with multiple terminals? Net topologies needed before point-to-point routing.
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Construction of Optimal Data Aggregation Trees for Wireless Sensor Networks Deying Li, Jiannong Cao, Ming Liu, and Yuan Zheng Computer Communications and.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Probabilistic Data Management
On Efficient Graph Substructure Selection
Keyword Searching and Browsing in Databases using BANKS
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Efficient Subgraph Similarity All-Matching
Bidirectional Query Planning Algorithm
Supporting High-Performance Data Processing on Flat-Files
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Learning to Create Data-Integration Queries Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, Sudipto Guha VLDB2008 Seminar Presented by Noel Gunasekar CSE Department – SUNY Buffalo

Learning to Create Data Integration Queries 2 Learning to Create Data-Integration Queries Introduction Motivation Example Existing solutions Q-System Solution Q-System Architecture Query and Query Answers Executing Query Learning From feedback Conclusion Experimental results Future Work

Learning to Create Data Integration Queries 3 Introduction

Learning to Create Data Integration Queries 4 Motivation Need for non-expert user to pose queries across multiple data resources. Non-expert user - Not familiar with querying languages Multiple resource - Databases, Data warehouses, Virtual integrated schemas

Bio-Science Field Learning to Create Data Integration Queries5 Many "standardized" databases with overlapping and cross-referenced information Each site is being independently extended, corrected, and analyzed Differing levels of data quality/confidence Protein Databases Protein DataBase: - PDB information and service listings at Brookhaven National Laboratory [ BNL ] PIR: - Protein Identification Resource database at [ JHU ] PRF: - Protein Research Foundation database at GenomeNet SwissProt - Protein database at ExPASy [ Switzerland ]

Example Learning to Create Data Integration Queries6 genomics Life Sciences Researcher Disease Studies Life Sciences researcher querying on data-sources like genomics, disease studies and pharmacology. Pharmacology “ What are the proteins and genes associated with the disease Narcolepsy?”

Learning to Create Data Integration Queries7 Existing Solution Using keyword based queries on Web-Forms Match the keywords with terms in the tuples and form the query by joining different databases using foreign-keys Cost for the query is fixed and doesn’t accommodate the context of the query

Learning to Create Data Integration Queries8 Proposed Solution - Q System Automatically generate Web-Forms for given set of keywords Pose queries across multiple data resources using the generated web-form

Learning to Create Data Integration Queries9 Proposed Solution - Q System Q System Keywords Protein, gene, disease Reusable Web-Form For querying User (Author) Create re-usable web-form Use web-form for Querying Reusable Web-Form For querying Parameters Users (Author + others) Query Results

Learning to Create Data Integration Queries 10 Q System Architecture

Learning to Create Data Integration Queries11 Architecture of Q System Four Components Initial Schema Loader Query Template Creation Query Execution Learning Through Feedback

Learning to Create Data Integration Queries12 Architecture of Q System

Learning to Create Data Integration Queries13 Initial Setup Schema Loader Input Given a set of data sources with its own schema Foreign Keys and Links Schema Mappings Record Link Output Schema Graph

Learning to Create Data Integration Queries14 Example Schema Graph Node: Databases and their attributes (UniProt database, Entrez GeneInfo db, term) Edge: Relation based on foreign keys/cross-references (UniProt to PIR) Cost: Reliability, completeness cb d Initial Setup

Learning to Create Data Integration Queries15 Query Template Creation

Learning to Create Data Integration Queries16 Query Template Creation - Example Input: “protein”, “plasma membrane”, “gene” and “disease” Output:

Find trees connecting red nodes e acb fd Schema Graph Rank = 2 Cost = 0.41 Rank = 1 Cost = 0.4 e acb fd e ab fd Query Keywords a, e, f Q2 Q1 Query Template Creation - Example

Query Formulation Trees can be easily written as executable queries: Steiner Tree Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v) e ab fd 0.1

View Refinement

Web-Form

Learning to Create Data Integration Queries21 Query Execution

Input: Web-Form

Output: Result Answers Q1 Q1,2 Q2 System determines “producer” queries using provenance

Learning to Create Data Integration Queries24 Query Execution Query Processing Engine with Support for querying remote data sources Record data provenance Solution: ORCHESTRA

Learning to Create Data Integration Queries25 Orchestra Project The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences Bioinformatics Scenario - many "standardized" databases with overlapping information, similar but not identical data and differing levels of data quality/confidence Each site is being independently extended, corrected, and analyzed ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users

Learning to Create Data Integration Queries26 Orchestra Project – Data Provenance

Learning to Create Data Integration Queries27 Learning through Feedback

Learning to Create Data Integration Queries28 Input: Ranked Results + provenance Q1 Q1,2 Q2 Learning through Feedback

Learning to Create Data Integration Queries29 User provides feedback Q1 Q1,2 Q2 Learning through Feedback

Query Formulation - Recap Find trees connecting red nodes e acb fd Schema Graph Rank = 2 Cost = 0.41 Rank = 1 Cost = 0.4 e acb fd e ab fd Query Keywords a, e, f Q2 Q1

e acb fd e acb fd e ab fd  Change weights so Q2 is “cheaper” than Q1 Rank = 1 Cost = 0.4 Rank = 2 Cost = Rank = 2 Cost = 0.4 Rank = 1 Cost = Q1 Q Learning through Feedback

Learning to Create Data Integration Queries32 Iteration!

Q-System: Challenges  Computation of ranked queries which in turn produce ranked tuples: K-Best Steiner Tree Generation  Predicting new query rankings based on user feedback over tuples, and also generalizing feedback: Learning  Maintaining associations between tuples and queries: Query answers with provenance  Everything at interactive speed!

Cost of a Query Query Cost = Sum of edge costs in the tree. Edge Cost = Sum of weights of features defined over it. Features are properties of the edges, e.g., nodes connected Each feature has a corresponding weight. Feature example: TermSynonym f1w8w8 f = 1 if the edge connects Term and Synonym tables, else 0

Steiner Trees: Finding Lowest-Cost Queries  A tree of minimal cost in a graph (G) which includes all the required nodes (S).  Cost of a Steiner Tree is the sum of costs of edges present in the tree.  Steiner Tree is generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G]. e acb fd

K-Best Steiner Tree Algorithms Exact (practical for ~100 nodes and edges). Integer Linear Program (ILP) based formulation for finding K-best Steiner Trees in a graph. The ILP uses ideas from multi-commodity network flows Approximate (for 100s+ nodes and edges). Novel Shortest Paths Complete Subgraph Heuristic. Significantly faster; in practice, often gives optimal solution.

Multi-Commodity Flow Problem

MIP for min-cost Steiner Tree

MIP for K min-cost Steiner Tree

Constraints C1 : Flow of commodity k starts at root r C2 : Flow of commodity k terminates at node k C3 : Conservation of flow at Steiner nodes C4 : Flow of an edge allowed only if that edge is included ( Yij = 1 ) C5 : Non-negativity constraint C6 : Defines value for Y C7 : Ensures no incoming active edge into the root C8 : Ensures that all nodes have at most one incoming active edge C9 : Flow of at least one commodity on all edges in I C10 : Ensures no flow on edges in X

Finding K-Best Steiner Trees 2-best Steiner trees connecting terminal nodes. e acb fd Rank = 2 Cost = 0.41 Rank = 1 Cost = 0.4 e acb fd e ab fd

K-Best Steiner Tree Algorithms Approximate (for 100s+ nodes and edges). Novel Shortest Paths Complete Subgraph Heuristic. Send “m” shortest path graph as input. Shortest path between each pair of nodes in S Significantly faster; in practice, often gives optimal solution.

Q Challenge : Getting User Feedback e ab fd T e ab fd c T*T* Q Q*Q* Query Tuples Top Bottom T and T * differ in 3 edges. This difference is termed loss: L(T, T * )

Learning: Update Weights Term2TermTerm(T1) Edge Cost: 0.07 w 8 = 0.06 w 25 = 0.01 Term2TermTerm(T1) Edge Cost: 0.05 w 8 = 0.04 w 25 = 0.01 Edge Cost Update

Re-ranked Steiner Trees Weight Update Rank 1Rank 2 e acb fd e ab fd e acb fd e ab fd

Experimental Results The Key Questions Can the algorithm start with uninitialized weights and learn expert (“gold standard”) ranking of queries? Can the results be generated at interactive speeds? Does the approach scale to larger graphs?

Results: Learning Expert Weights Graph: Start with the BioGuide bio sources, with 28 vertices and 96 edges. Goal: Learn the queries corresponding to the expert-set weights in BioGuide Methodology: All weights are set to default. Sequence of 25 searches For each, user feedback identifies & promotes a tuple from the gold standard answer. After 40-60% searches with feedback, system finds the top query immediately. For each individual search, a single feedback is enough to learn the top query. # Gold queries absent in top-3 predictions

Results: Time to generate K-best Queries KTime (s) Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). It is possible to generate the top query in < 1 sec and top 5 queries in about 2 sec, all within interactive range. Query execution is pipelined.

Results: Scalability to Larger Graphs KSpeedupError Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case). Queries (K)

Learning to Create Data Integration Queries50 Experimental Results “Gold Standard” Using BioGuide – a biomedical information integration system BioGuide generates the schema graph based on keywords The edge cost in the schema graph are manually assigned by experts This expert given schema graph is called the “gold standard” Experiment involves in comparing the result produced by the q system with the results produced by the gold standard.

Learning to Create Data Integration Queries51 Learning against expert cost Started with an expert query template “What are the related proteins in [DB1] and genes in [DB2] associated with disease Narcolepsy in [DB3] ? By instantiating the template 25 queries were formed Each time for a query the lowest-cost Steiner tree is computed The “gold standard” is used as the feedback and learning is done

Learning to Create Data Integration Queries52 Future Work Work on other approximation algorithms for computing the Steiner trees Evaluation against real biological applications Incorporating data-level keyword matches.