ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 16: KEYWORD SEARCH PRINCIPLES OF DATA INTEGRATION.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
What is a Database By: Cristian Dubon.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
DATA MODELS A collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints. Provide a way to describe.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
ISP 433/533 Week 2 IR Models.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Ensemble Learning: An Introduction
Modeling Modern Information Retrieval
Vector Space Model CS 652 Information Extraction and Integration.
Chapter 19: Information Retrieval
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Database Management 9. course. Execution of queries.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Ferdowsi University of Mashhad 1 Automatic Semantic Web Service Composition based on owl-s Research Proposal presented by : Toktam ghafarian.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Lecture 2 The Relational Model
Information Retrieval
Data Integration for Relational Web
Keyword Searching and Browsing in Databases using BANKS
Overview of Query Evaluation
Probabilistic Databases
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 16: KEYWORD SEARCH PRINCIPLES OF DATA INTEGRATION

Keyword Search over Structured Data  Anyone who has used a computer knows how to use keyword search  No need to understand logic or query languages  No need to understand (or have) structure in the data  Database-style queries are more precise, but:  Are more difficult for users to specify  Require a schema to query over!  Constructing a mediated, queriable schema is one of the major challenges in getting a data integration system deployed  Can we use keyword search to help?

The Foundations  Keyword search was studied in the database context before being extended to data integration  We’ll start with these foundations before looking at what is different in the integration context  How we model a database and the keyword search problem  How we process keyword searches and efficiently return the top-scoring (top-k) results

Outline  Basic concepts  Data graph  Keyword matching and scoring models  Algorithms for ranked results  Keyword search for data integration

The Data Graph Captures relationships and their strengths, among data and metadata items Nodes  Classes, tables, attributes, field values  May be weighted – representing authoritativeness, quality, correctness, etc. Edges  is-a and has-a relationships, foreign keys, hyperlinks, record links, schema alignments, possible joins, …  May be weighted – representing strength of the connection, probability of match, etc.

Querying the Data Graph  Queries are expressed as sets of keywords  We match keywords to nodes, then seek to find a way to “connect” the matches in a tree  The lowest-cost tree connecting a set of nodes is called a Steiner tree  Formally, we want the top-k Steiner trees  … However, this is NP-hard in the size of the graph!

Data Graph Example – Gene Terms, Classifications, Publications  Blue nodes represent tables  Genetic terms, record link to ontology, record link to publications, etc.  Pink nodes represent attributes (columns)  Brown rectangles represent field values  Edges represent foreign keys, membership, etc. Term Term2 Ontology Entry2 Pub Pubs accname... go_identry_ac Standard abbrevs abbrevtermentry_acpub_id... pub_id... title Entry entry_acname... pubpublication GO:00059plasma membrane...

Querying the Data Graph Term Term2 Ontology Entry2 Pub Pubs accname... go_identry_ac Standard abbrevs abbrevtermentry_acpub_id... pub_id... title Entry entry_acname... pubpublication GO:00059plasma membrane... membrane publication Relational query 1 tree: Term, Term2Ontology, Entry2Pub, Pubs Relational query 2 tree: Term, Term2Ontology, Entry, Pubs title An index to tables, not part of results

Trees to Ranked Results Each query Steiner tree becomes a conjunctive query  Return matching attributes, keys of matching relations  Nodes  relation atoms, variables, bound values  Edges  join predicates, inclusion, etc.  Keyword matches to value nodes  selection predicates Query tree 1 becomes: q1(A,P,T) :- Term(A, “plasma membrane”), Term2Ontology(A, E), Entry2Pub(E, P), Pubs(P, T) Computing and executing this query yields results  Assign a score to each, based on the weights in the query and similarity scores from approximate joins or matches

Where Do Weights Come from? Node weights:  Expert scores  PageRank and other authoritativeness scores  Data quality metrics Edge weights:  String similarity metrics (edit distance, TF*IDF, etc.)  Schema matching scores  Probabilistic matches In some systems the weights are all learned

Scoring Query Results  The next issue: how to compose the scores in a query tree  Weights are treated as costs or dissimilarities  We want the k lowest-cost  Two common scoring models exist:  Sum the edge weights in the query tree  The tree may have a required root (in some models), or not  If there are node weights, move onto extra edges – see text  Sum the costs of root-to-leaf edge costs  This is for trees with required roots  There may be multiple overlapping root  leaf paths  Certain edges get double-counted, but they are independent

Outline Basic concepts  Algorithms for ranked results  Keyword search for data integration

Top-k Answers  The challenge – efficiently computing the top-k scoring answers, at scale  Two general classes of algorithms  Graph expansion -- score is based on edge weights  Model data + schema as a single graph  Use a heuristic search strategy to explore from keyword matches to find trees  Threshold-based merging – score is a function of field values  Given a scoring function that depends on multiple attributes, how do we merge the results?  Often combinations of the two are used

Graph Expansion Basic process:  Use an inverted index to find matches between keywords and graph nodes  Iteratively search from the matches until we find trees Term Term2 Ontology Entry2 Pub Pubs accname... go_identry_acentry_acpub_id... pub_id... title GO:00059plasma membrane... membrane title

What Is the Expansion Process? Assumptions here:  Query result will be a rooted tree -- root is based on direction of foreign keys  Scoring model is sum of edge weights ( see text for other cases ) Two main heuristics:  Backwards expansion  Create a “cluster” for each leaf node  Expand by following foreign keys backwards: lowest-cost-first  Repeat until clusters intersect  Bidirectional expansion  Also have a “cluster” for the root node  Expand clusters in prioritized way

Querying the Data Graph Term Term2 Ontology Entry2 Pub Pubs accname... go_identry_ac Standard abbrevs abbrevtermentry_acpub_id... pub_id... title Entry entry_acname... pubpublication GO:00059plasma membrane... membrane publication title

Graph vs. Attribute-Based Scores  The previous strategy focuses on finding different subgraphs to identify the tuples to return  Assumes the costs are defined from edge weights  Uses prioritized exploration to find connections  But part of the score may be defined in terms of the values of specific attributes in the query score = … + weight 1 * T 1.attrib 1 + weight 2 * T 2.attrib 2 + …  Assume we have an index of “partial tuples” by sort order of the attributes  … and a way of computing the remaining results – e.g., by joining the partial tuples with others

Threshold-based Merging with Random Access  Given multiple sorted indices L 1, …, L m over the same “stream of tuples” try to return the k best-cost tuples with the fewest I/Os  Assume cost function t(x 1,x 2,x 3,…, x m ) is monotone, i.e., t(x 1,x 2,x 3,…, x m ) ≤ t(x 1 ’,x 2 ’, x 3 ’, …, x m ’) whenever x i ’ ≤ x i ’ for every i  Assume we can retrieve/compute tuples with each x i L 1 : Index on x 1 L 2 : Index on x 2 L m : Index on x m … Threshold-based Merge k best ranked results cost = t(x 1,x 2,x 3,…, x m )

The Basic Thresholding Algorithm with Random Access (Sketch) In parallel, read each of the indices L i  For each x i retrieved from L i retrieve the tuple R  Obtain the full set of tuples R containing R  this may involve computing a join query with R  Compute the score t(R’) for each tuple R’ ∈ R  If t(R’) is one of the k-best scores, remember R’ and t(R’)  break ties arbitrarily  For each index L i let x i be the lowest value of x i read from the index  Set a threshold value τ = t(x 1, x 2, …, x m )  Once we have seen k objects whose score is at least equal to τ, halt and return the k highest-scoring tuples that have been remembered

An Example: Tables & Indices namelocationratingprice Alma de Cuba1523 Walnut St.43 Moshulu401 S. Columbus bldv.44 Sotto Varalli231 S. Broad St Mcgillin’s1310 Drury St.42 Di Nardo’s Seafood312 Race st.32 ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price) name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu Full data: L rating : Index by ratings L price: Index by (5 - price)

Reading and Merging Results ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price)name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu L ratings L price t alma = 0.5* *2 = 3 Cost formula: t(rating,price) = rating * (5 - price) * 0.5 t mcgillins = 0.5* *3 = 3.5 τ = 0.5* *3 = 3.5 no tuples above τ!

t moshulu = 0.5* *1 = 2.5t dinardo’s = 0.5* *3 = 2.5 Reading and Merging Results ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price)name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu L ratings L price t alma = 0.5* *2 = 3 Cost formula: t(rating,price) = rating * (5 - price) * 0.5 t mcgillins = 0.5* *3 = 3.5 τ = 0.5* *3 = 3.5 no tuples above τ!

t moshulu = 0.5* *1 = 2.5t dinardo’s = 0.5* *3 = 2.5 Reading and Merging Results ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price)name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu L ratings L price t alma = 0.5* *2 = 3 Cost formula: t(rating,price) = rating * (5 - price) * 0.5 t mcgillins = 0.5* *3 = 3.5 these have already been read!

t sotto = 0.5* *2 = 2.75 t moshulu = 0.5* *1 = 2.5t dinardo’s = 0.5* *3 = 2.5 Reading and Merging Results ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price)name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu L ratings L price t alma = 0.5* *2 = 3 Cost formula: t(rating,price) = rating * (5 - price) * 0.5 t mcgillins = 0.5* *3 = 3.5 τ = 0.5* *2 = 2.75

t sotto = 0.5* *2 = 2.75 t moshulu = 0.5* *1 = 2.5t dinardo’s = 0.5* *3 = 2.5 Reading and Merging Results ratingname 4Alma de Cuba 4Moshulu 4Mcgillin’s 3.5Sotto Varalli 3Di Nardo’s Seafood (5-price)name 3McGillin’s 3Di Nardo’s Seafood 2Alma de Cuba 2Sotto Varalli 1Moshulu L ratings L price t alma = 0.5* *2 = 3 Cost formula: t(rating,price) = rating * (5 - price) * 0.5 t mcgillins = 0.5* *3 = 3.5 τ = 0.5* *2 = are above threshold

Summary of Top-k Algorithms  Algorithms for producing top-k results seek to minimize the amount of computation and I/O  Graph-based methods start with leaf and root nodes, do a prioritized search  Threshold-based algorithms seek to minimize the amount of full computation that needs to happen  Require a way of accessing subresults by each score component, in decreasing order of the score component  These are the main building blocks to keyword search over databases, and sometimes used in combination

Outline Basic concepts Algorithms for ranked results  Keyword search for data integration

Extending Keyword Search from Databases to Data Integration Integration poses several new challenges: 1.Data is distributed  This requires techniques such as those from Chapter 8 and from earlier in this section 2.We cannot assume the edges in the data graph are already known and encoded as foreign keys, etc.  In the integration setting we may need to automatically infer them, using schema matching (Chapter 5) and record linking (Chapter 4) 3.Relations from different sources may represent different viewpoints and may not be mutually consistent  Query answers should reflect the user’s assessment of the sources  We may need to use learning on this   

Scalable Automatic Edge Inference In a scalable way, we may need to:  Discover data values that might be useful to join  Can look at value overlap  An “embarassingly parallel” task – easily computable on a cluster  Discover semantically compatible relationships  Essentially a schema matching problem  Combine evidence from the above two  Roughly the same problem as within a modern schema matching tool  Use standard techniques from Chapters 4-5, but consider interactions with the query cost model and the learning model

Learning to Adjust Weights  We may want to learn which sources are most relevant, which edges in the graph are valid or invalid  Basic idea: introduce a loop: Find edges in graph Create query from search Compute top ranked results Collect user feedback Learn from feedback

Example Query Results & User Feedback

How Do We Learn about Edge and Node Weights from Feedback on Data?  We need data provenance (Chapter 14) to “explain” the relationship between each output tuple and the queries that generated it  The score components (e.g., schema matcher values) need to be represented as features for a machine learning algorithm  We need an online learning algorithm that can take the feedback and adjust weights  Typically based on perceptrons or support vector machines

Keyword Search Wrap-up  Keyword search represents an interesting point between Web search and conventional data integration  Can pose queries with little or no administrator work (mediated schemas, mappings, etc.)  Trade-offs: ranked results only, results may have heterogeneous schemas, quality will be more variable  Based on a model and techniques used for keyword search in databases  But needs support for automatic inference of edges, plus learning of where mistakes were made!