Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

LAOS: Layered WWW AHS Authoring Model and their corresponding Algebraic Operators Dr. Alexandra Cristea
Configuration management
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
CSE 636 Data Integration Data Integration Approaches.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Shopping Agents.
Project Proposal.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
How can Computer Science contribute to Research Publishing?
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Concrete Learning Agents.
CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
This chapter is extracted from Sommerville’s slides. Text book chapter
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Query Translation of Web Database Integration: Issues, Advances and Directions Fangjiao Jiang.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
 Copyright 2006 Digital Enterprise Research Institute. All rights reserved. Collaborative Building of Controlled Vocabularies Crosswalks Mateusz.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Survey of Semantic Annotation Platforms
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
The Internet 8th Edition Tutorial 4 Searching the Web.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Statistical Schema Matching across Web Query Interfaces
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Integrating Taxonomies
Toward Large Scale Integration
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.
Presentation transcript:

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign) Published in:Proceedings of the 31 st VLDB Conference, Trondheim, Norway 2005 Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

Outline Overview Problem Description, Motivating Example System Architecture Design Approaches Query Modeling and Translation Dynamic Predicate Mapping Implementation - Form Assistant Toolkit Experiments Related Work

Problem Description “Deep Web” Estimated to contain 450,000 online databases (2004) Sometimes referred to as “Invisible Web” or “Hidden Web” Much of this is accessible only by query forms instead of static URL links Common domains such as: books, cars, airfares

Problem Description Often it can be useful to query multiple alternative sources in the same domain Automation of this entails several components One key component is dynamic query translation Software toolkit “Form Assistant” designed to provide potential translations of user queries for alternative sources e.g., User-entered Amazon form query automatically translated to potential Barnes & Noble form query

Problem Description Goals of query translator: Source-generality Built-in translation must generally cope with new or “unseen” sources Domain-portability Translator must be easily customizable with domain-specific knowledge, and thus deployable for new domains

Motivating Example Source query Q s on source form S: (e.g. Amazon) Target query form T: (e.g. Barnes & Noble)

Motivating Example Source query Q s on source form S Target query form T Tom Clancy U Query Translation Filter: : σ title contain “red storm” and price 12 Union Query Q t *:

System Architecture Form Extractor Source query Q s Target query form QI Attribute Matcher: Syntax-based schema matching Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Q t * Domain-specific Thesaurus Domain-specific type handlers Form Assistant (FA)

Design Approaches Query Modeling Vocabulary and Syntax Query Translation Dynamic Predicate Modeling

Query Modeling Vocabulary Predicate templates: { P 1, P 2, P 3, P 4, P 5 } Example: P1P1 P3P3 P4P4 P2P2 P5P5

Query Modeling Example Vocabulary (predicate templates) P 1 = [author; contain; $au] P 2 = [title; contain; $ti] P 3 = [subject; contain; $su] P 4 = [isbn; contain; $isbn] P 5 = [price; between; $s, $e] Example Syntax (valid conjunctive forms) F 1 = P 1 P 5 F 2 = P 2 P 5 F 3 = P 3 P 5 F 4 = P 4 P 5 F 5 = P 1 F 6 = P 2 F 7 = P 3 F 8 = P 4

Query Modeling Example Vocabulary Instantiations p 1 = [author; contain; Tom Clancy] p 2 = [title; contain; red storm] p 5 1 = [price; between; 0-25] p 5 2 = [price; between; 25-45] Corresponding Form Queries: f 1 = p 1 p 5 1 f 2 = p 1 p 5 2 Resultant Union Query: Q t = f 1 f 2

Query Modeling Syntax Valid combination of predicate templates {F 1, F 2, F 3, F 4, F 5, F 6, F 7, F 8 } Example (‘v’ indicates ‘valid’): F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F8F8 P 1 (author) νν P 2 (title) νν P 3 (subject) νν P 4 (isbn) νν P 5 (price) vvvv Tom Clancy F1:F1: F2:F2:

Query Translation Based on semantic closeness of query predicates: Finds minimal subsuming C min Benefits of this approach: No false positives Minimizes false negatives Has clear semantics, independent of DB content Modular translation

Query Translation Example: t1:t1: 0 t2:t2: 2545 s: 350 t1 v t2:t1 v t2: 045 t3:t3: 6545 t1 v t2 v t3:t1 v t2 v t3: 065 ?  C min 25

Query Translation Definition: Given source query Q s and target query form T, a query Q t * is a “minimal subsuming translation” w.r.t. T if: 1. Q t * is a validquery w.r.t T 2. Q t * subsumes Q s i.e., for any database instance D i, Q s (D i ) ≤ Q t * (D i ) 3. Q t * is minimal i.e., there is no query Q t such that Q t satisfies (1.) and (2.) above and Q t * subsumes Q t

Query Translation Example: Consider source query Q s in first example and three target queries Q t1,Q t2,Q t3 Q t1 and Q t3 subsume Q s while Q t2 does not  Misses price range 0-25  Thus can’t be the best translation C min Prune Q t3 because it subsumes Q t1 That leaves Q t1 as C min Q t1 = (f 1 : p 1 p 5 1 ) (f 2 : p 1 p 5 2 ) Q t2 = f 2 Q t3 = f 3 : p 1 p 1 = [author; contain; Tom Clancy] p 5 1 = [price; between; 0-25] p 5 2 = [price; between; 25-45]

Dynamic Predicate Mapping Tasks: Choose operator Fill in values Objective: Minimal subsuming between source and target

Dynamic Predicate Mapping Example: Predicate Mapping U Input: output:

System Architecture (reminder) Form Extractor Source query Q s Target query form QI Attribute Matcher: Syntax-based schema matching Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Q t * Domain-specific Thesaurus Domain-specific type handlers Form Assistant (FA)

Implementation – Form Assistant Toolkit Form Extractor Parses HTML into query predicate templates [attr; op; val] Details discussed in a different paper [3.] by same research group Attribute Matcher (1:1) Identifies semantically corresponding attributes between forms Customized with domain thesaurus (indexes synonyms for commonly used concepts) Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”) Matched by value type and synonym attributes Predicate Mapper (discussed in previous slides) Query Rewriter Well-studied problem to find minimal subsuming query of given predicate- mapped query (uses approach of [5.] by Papakonstantinou, et al)

Experiments Datasets 447 Deep Web sources (query forms) in 8 domains 3 “Basic” domains – each with custom thesaurus in FA  Books, Airfares, Automobiles 5 “New” domains (for tests, these don’t have thesaurus)  Car Rentals, Jobs, Hotels, Movies, Music/Records Test Approach Run the FA to translate 120 form queries Each translation test corresponds to random pairing of sources within a domain Count correct mappings in translation suggested by FA Indicates amount of user effort the Form Assistant has saved

Experiments Results: Accuracy Distributions X: % correct predicate translations; Y: % tested query forms Forms with all 1:1 mappings had 87% perfect accuracy for Basic dataset, 85% perfect for New dataset (good domain flexibility) Forms having complex mapping: 76%, 70% “near perfect” (Y>80%) FA did not attempt complex (n:m) mappings, such as a full name in source mapping to separate first and last names in target For Basic dataset:For New dataset:

Experiments Accuracy ratio: correct results per 1:1 query Raw: includes some forms whose input form extraction step had errors Perfect: manually forces all correct form extractions Avg. accuracy improves for perfectly correct extraction step: for Basic dataset, 90.4% improves to 96.1% For New dataset, 81.1% improves to 86.7% Basic: 3 domainsNew: 5 domains

Experiments Example Error in Form Extraction delta.com form has link to alternative reservation page “One-way & multi-city reservations”  Wrongly interpreted by Form Extractor as input field label (attribute)

Experiments Error Distribution % of errors caused by each component Fewest errors are due to Attribute Matching Most errors due to Predicate Mapping Cited reason for PM errors is insufficient domain knowledge  Example failure: source subject value “computer science” didn’t properly map to target subject value “programming languages”  Improvement could entail better domain-specific ontology and type handlers Attribute Matching 18% 40% 42% Form Extraction Predicate Mapping

Related Work From the same research group: Complex Matchings (n:m) Defines “Type Recognizer” used in Form Assistant’s Attribute Matcher, and discusses complex n:m matchings not attempted by Form Assistant:  [1.] Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach. B. He, K. C.-C. Chang, and J. Han. In Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full Paper), Seattle, Washington, August 2004 MetaQuerier System Fuller system for both exploring (to find) and integrating (to query) Deep Web databases:  [2.] Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. K. C.-C. Chang, B. He, and Z. Zhang. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California, January 2005

Related Work From the same research group: Form Extraction As used by implementation of Form Assistant:  [3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June thorough analysis of the Deep Web Interesting survey of web databases and query interfaces:  [4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May 2007 Public Datasets Cached real world query form web pages (used in experiments):  Additional Deep Web integration resources: 

Related Work Query Rewriting As used by implementation of Form Assistant: [5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation scheme for rapid implementation of wrappers In proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases, Singapore, December 1995.

Thank you !