Improving Query Results using Answer Corroboration Amélie Marian Rutgers University.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Evaluating Search Engine
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Internet Resources Discovery (IRD) Search Engines Quality.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Overview of Search Engines
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Databases & Data Warehouses Chapter 3 Database Processing.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Data-Centric Human Computation Jennifer Widom Stanford University.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
A paper on Join Synopses for Approximate Query Answering
Martin Rajman, Martin Vesely
Keyword Searching and Browsing in Databases using BANKS
Structure and Content Scoring for XML
Panagiotis G. Ipeirotis Luis Gravano
Probabilistic Databases
Structure and Content Scoring for XML
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Improving Query Results using Answer Corroboration Amélie Marian Rutgers University

10/18/2006Amélie Marian - Rutgers University2 Motivations  Query on databases traditionally return exact answer (set of) tuples that match query exactly  Query in Information retrieval traditionally return best documents containing the answer (list of) documents from which users have to find relevant information within the documents  Both query models are insufficient for today’s information needs New models have been used and studied: top-k queries, question answering (QA) But these model consider answers individually (except for some QA systems)

10/18/2006Amélie Marian - Rutgers University3 Data Corroboration  Data sources cannot be fully trusted Low quality data (e.g., data integration, user- input data) Web data (anybody can say anything on the web)  Non exact query models Top-k answers are requested  Repeated information leads more credence to the quality of the information Aggregate similar information, and increase its score

10/18/2006Amélie Marian - Rutgers University4 Outline  Answer Corroboration for Data Cleaning joint work with Yannis Kotidis and Divesh Srivastava Motivations Multiple Join Path Framework Our Approach Experimental Evaluation  Answer Corroboration for Web Search Motivations Our Approach Query Interface

10/18/2006Amélie Marian - Rutgers University5 Motivating Example Sales TN BAN TN BAN CustName ORN PON Provisioning CustName PON SubPON Inventory PON TN CircuitID Ordering ORNTN TN: Telephone Number ORN: Order Number BAN: Billing Account Number PON: Provisioning Order Number SubPON: Related PON What is the Circuit ID associated with a Telephone Number that appears in SALES?

10/18/2006Amélie Marian - Rutgers University6 Motivations  Data applications with overlapping features Data integration Web sources  Data quality issues (duplicate, null, default values, data inconsistencies) Data-entry problems Data integration problems

10/18/2006Amélie Marian - Rutgers University7 Contributions  Multiple Join Path (MJP) framework Quantifies answer quality Takes corroborating evidence into account Agglomerative scoring of answers  Answer computation techniques Designed for MJP scoring methodologies Several output options (top-k, top-few)  Experimental evaluation on real data VIP integration platform Quality of answers Efficiency of our techniques

10/18/2006Amélie Marian - Rutgers University8 Multiple Join Path Framework: Problem Definition  Query of the form: “Given X=a find the value of Y” Examples: Given a telephone number of a customer, find the ID of the circuit to which the telephone line is attached. One answer expected Given a circuit ID, find the name of customers whose telephones are attached to the circuit ID. Possibly several answers

10/18/2006Amélie Marian - Rutgers University9 Schema Graph  Directed acyclic graph  Nodes are field names  Intra-application edge Links fields in the same application  Inter-application edge Links fields across applications All (non-source, non-sink) nodes in schema graph are (possibly approximate) primary or foreign keys of their applications

10/18/2006Amélie Marian - Rutgers University10 Data Graph  Given a specific value of the source node X what are values of the sink node Y?  Considers all join paths from X to Y in the schema graph X (no corresponding SALES.BAN) X X Example: two paths lead to answer c1

10/18/2006Amélie Marian - Rutgers University11 Scoring Answers  Which are the correct values? Unclean data No a priori knowledge  Technique to score data edges What is the probability that the fields associated by the edge is correct  Probabilistic interpretation of data edge scores to score full join paths Edge score aggregation Independent on the length of the path

10/18/2006Amélie Marian - Rutgers University12 Scoring Data Edges  Rely on functional dependencies (we are considering fields that are keys)  Data edge scores model the error in the data  Intra-application edge  Inter-application edge equals 1, unless approximate matching Fields A and B within the same application AB (and symetrically for B -> A) Where b i are the values instantiated from querying the application with value a ABBAand

10/18/2006Amélie Marian - Rutgers University13 Scoring Data Paths  A single data path is scored using a simple sequential composition of its data edges probabilities  Data paths leading to the same answer are scored using parallel composition XabY pathScore=0.5*0.8*0.6=0.24 XabY c pathScore= (0.24*0.2) pathScore= Independence Assumption

10/18/2006Amélie Marian - Rutgers University14 Identifying Answers  Only interested in best answers  Standard top-k techniques do not apply Answer scores can always be increased by new information We keep score range information Return top answers when identified, may not have complete scores (similar to NRA by Fagin et al.)  Two return strategies Top-k Top-few (weaker stop condition)

10/18/2006Amélie Marian - Rutgers University15 Computing Answers  Take advantage of early pruning Only interested in best answers  Incremental data graph computation Probes to each applications Cost model is number of probes  Standard graph searching techniques (DFS, BFS) do not take advantage of score information  We propose a technique based on the notion of maximum benefit

10/18/2006Amélie Marian - Rutgers University16 Maximum Benefit  Benefit computation of a path uses two components Known scores of the explored data edges Best way to augment an answer’s scores  Uses residual benefit of unexplored schema edges  Our strategy makes choices that aim at maximizing this benefit metric

10/18/2006Amélie Marian - Rutgers University17 VIP Experimental Platform  Integration platform developed at AT&T  30 legacy systems  Real data  Developed as a platform for resolving disputes between applications that are due to data inconsistencies  Front-end web interface

10/18/2006Amélie Marian - Rutgers University18 VIP Queries  Random sample of 150 user queries.  Analysis shows that queries can be classified according to the number of answers they retrieve: noAnswer(nA): 56 queries anyAnswer(aA): 94 queries  oneLarge(oL): 47 queries  manyLarge(mL): 4 queries  manySmall(mS): 8 queries heavyHitters(hH): 10 queries that returned between 128 and 257 answers per query

10/18/2006Amélie Marian - Rutgers University19 VIP Schema Graph Paths leading to an answer /paths leading to top-1 answer (94 queries) Not considering all paths may lead to missing top-1 answers

10/18/2006Amélie Marian - Rutgers University20 Number of Parallel Paths Contributing to the Top-1 Answer Average of 10 parallel paths per answer, 2.5 significant

10/18/2006Amélie Marian - Rutgers University21 Cost of Execution

10/18/2006Amélie Marian - Rutgers University22 Related Work (Data Cleaning)  Keyword Search in DBMS (BANKS, DBXPlorer, DISCOVER, ObjectRank) Query is set of keywords Top-k query model DB as data graph Do not agglomerate scores  Top-k query evaluation (TA, MPro, Upper) Consider tuples as an entity Wait for exact answer (Except for NRA) Do not agglomerate scores  Probabilistic ranking of DB results Queries not selective, large answer set We take corroborative evidence into account to rank query results

10/18/2006Amélie Marian - Rutgers University23 Contributions  Multiple Join Path Framework Uses corroborating evidence to identify high quality results Looks at all paths in the schema graph  Scoring mechanism Probabilistic interpretation Takes schema information into account  Techniques to compute answers Take into account agglomerative scoring Top-k and top-few

10/18/2006Amélie Marian - Rutgers University24 Outline  Answer Corroboration for Data Cleaning Motivations Multiple Join Path Framework Our Approach Experimental Evaluation  Answer Corroboration for Web Search Motivations Our Approach Challenges

10/18/2006Amélie Marian - Rutgers University25 Motivations  Information on web sources is unreliable Erroneous Misleading Biased Outdated  Users check many web sites to confirm the information Data corroboration Can we do that automatically to save time?

10/18/2006Amélie Marian - Rutgers University26 Example: What is the gas mileage of my Honda Civic Query: “honda civic 2005 gas mileage” on MSN Search  Is the top hit; the carhybrids.com site trustworthy?  Is the Honda web site unbiased?  Are all these values refering to the correct year of the model ? Users may check several web sites to get an answer

10/18/2006Amélie Marian - Rutgers University27 Example: Aggregating Results using Data Corroboration  Combines similar values  Use frequency of the answer as the ranking measure (out of the first 10 pages; one page had no answer)

10/18/2006Amélie Marian - Rutgers University28 Challenges  Designing a meaningful ranking function Frequency of the answer in the result set Importance of the web pages containing the answer  As measured by the search engine (e.g. Pagerank) Importance of the answer within the page  Use of formatting information within the page  Proximity of the answer to query term  Multiple answers per page Similarity of the page with other pages  Dampening factor  Reduce the impact of copy-paste sites  Reduce the impact of pages from same domain

10/18/2006Amélie Marian - Rutgers University29 Challenges (cont.)  Selecting the result set (web pages) How deep in the search engine result are we going? Low ranked page will not contribute much to the score: use top-k pruning techniques  Extracting information from the web page Use existing Information Extraction (IE) and Question Answering (QA) techniques

10/18/2006Amélie Marian - Rutgers University30 Current work  Focus on numerical queries Analysis of MSN queries show that they have a higher clickthrough rate than general queries Answer easier to identify in the text  Scoring function Currently a simple aggregation of individual parameter scores Working on a probabilistic approach  Number of page accessed Dynamic selection based on score information

10/18/2006Amélie Marian - Rutgers University31 Evaluation  15 million query logs from MSN  Focus on: Queries with high clickthrough rate Numerical value queries (for now)  Compare clickthrough with best-ranked sites to measure precision and recall  User studies

10/18/2006Amélie Marian - Rutgers University32 Interface

10/18/2006Amélie Marian - Rutgers University33 Related work  Web Search Our interface is build on top of a standard search engine  Question Answering Systems (START, askMSR, MULDER) Some have used frequency of answer to increase score (askMSR, MULDER) We are considering more complex scoring mechanisms  Information Extraction (Snowball) We can use existing technique to identify information within a page Our problem is much simpler than standard IE  Top-k queries (TA, Upper, MPro) We need some pruning functionalities to stop retrieving web search results

10/18/2006Amélie Marian - Rutgers University34 Conclusions  Large amount of low-quality data Users have to rummage through a lot of information  Data corroboration can improve the quality of query results Has not been used much in practice Makes sense in many applications  Standard ranking techniques have to be modified to handle corroborative scoring Standard ranking scored each answer individually Corroborative ranking combines answer Pruning conditions in top-k queries do not work on corroborative answers