Toward Large Scale Integration

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Information Retrieval in Practice
Search Engines and Information Retrieval
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Querying Structured Text in an XML Database By Xuemei Luo.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Information Architecture
Statistical Schema Matching across Web Query Interfaces
PRESENTED BY: PEAR A BHUIYAN
Web Data Extraction Based on Partial Tree Alignment
Lecture 12: Data Wrangling
Information Retrieval
Data Integration for Relational Web
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 5: Information Retrieval and Web Search
Measuring Complexity of Web Pages Using Gate
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Context-Aware Internet
Information Retrieval and Web Design
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Toward Large Scale Integration Building a MetaQuerier over Databases on the Web 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? Deep Web refers to World Wide Web content that is NOT part of the surface Web (i.e., not part of the web that is indexed by standard search engines). 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? 14/05/2019 Sagar Khushalani

What is ‘Deep Web’? Terabytes of content hidden behind database interfaces Invisible to crawlers, and therefore invisible to users Difficulties: Find the right database(s) Query the database(s) 14/05/2019 Sagar Khushalani

Overview MetaQuerier Goals, Features and Challenges System Architecture Database Crawler Interface Extraction Schema Matching (SM) Introduction to DCM Data Preparation Complex Matching Correlation Measure Putting it together Ensemble Feedback Conclusion 14/05/2019 Sagar Khushalani

Goals & Challenges Goals: Make the “deep web” accessible (find) Make it usable (query) Challenges: What are the query capabilities of a source? How to mediate queries? Factors that help: Regulation across sources in the same domain New sources influenced by existing sources 14/05/2019 Sagar Khushalani

MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani

MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani

Database Crawler Functionality: Automatically find databases by identifying query interfaces Insight: Query interfaces are often found near the base (root) of the web page Approach: Site Collector Shallow Crawler 14/05/2019 Sagar Khushalani

MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani

Interface Extraction Functionality: Extract query templates of given interface as a 3-tuple Insight: HTML query forms have a certain hidden syntax. This can be used to convert the problem to a text-parsing problem and use parsing trees. Approach: Authors use ‘visual language’ parsing 14/05/2019 Sagar Khushalani

MetaQuerier: System Architecture 14/05/2019 Sagar Khushalani

Schema Matching: Functionality Attributes from different interfaces need to be analysed Equivalent attributes need to be combined (query mediation) To find equivalent attributes or attribute groups, we need schema matching 14/05/2019 Sagar Khushalani

Schema Matching: Procedure 14/05/2019 Sagar Khushalani

Schema Matching: Approach Simple matching vs. Complex matching Current schema matching methods only compare two schemas at a time, and cannot work with groups E.g.: {passengers} = {adults, seniors, children, infants} Complex matching allows context information to be used Grouping Attributes & Synonym Attributes 14/05/2019 Sagar Khushalani

Motivation: Example User Amy wants to buy 2 tickets to fly from Anchorage to Baltimore – one for herself, one for her child Website 1 attributes: From, To, Number of passengers Website 2 attributes: Origin, Destination, Adults, Children, Infants, Seniors 14/05/2019 Sagar Khushalani

Example: Complex Matching E.g.: {adults, seniors, children} = {number of tickets} {from} = {leaving from} Grouping = {adults, children, seniors} Synonym = {number of tickets} = {adults, seniors, children} 14/05/2019 Sagar Khushalani

Data Preparation HTML forms are not “minable” Pre-processing: The data needs to be “prepared” Extracting the form Attribute type recognition Syntactic merging 14/05/2019 Sagar Khushalani

Data Preparation: Form Extraction Read a webpage with query forms and extracts attribute names: Title of Book: <name = “title of book”, domain = any> Method: Parsing 14/05/2019 Sagar Khushalani

Data Preparation: Type Recognition Confusion: “departing” can mean: City of departure Time of departure Entities are distinguished by both attribute name and type, thus avoiding confusion caused by homonyms Type Recognizer 14/05/2019 Sagar Khushalani

Data Preparation: Syntactic Merging Name-based merging: Merge two attributes if they are similar in names Only merge when A is a variation of B (e.g. “title” and “title of book”) and B is more frequently used than A Domain-based merging: Merge if similar in domain values Only consider string-type attributes 14/05/2019 Sagar Khushalani

Correlation Measure What is correlation? A testing/score based on the contingency table. Contingency Table Two types of correlation: Positive (Mp) Negative (Mn) 14/05/2019 Sagar Khushalani

Correlation Measure Negative Correlation: mn mn = H(Ap,Aq) Positive Correlation: mp If f11/f++ < Td: mp = 1 – H(Ap,Aq) Else: mp = 0 Td is a threshold parameter Fig: Contingency table for Ap,Aq H(Ap,Aq) = (f01f10 / f+1f1+) 14/05/2019 Sagar Khushalani

Matching: Discovery Step I: Group Discovery: Positively co-related attributes can form potential groups However, if the group has no negative correlations with other groups, it has no use for matching E.g.: {lastname,firstname} – if all sources have this group, there is no need for matching 14/05/2019 Sagar Khushalani

Matching: Discovery Step II: Complex Matching Discovery Matching Discovery works on attribute groups Negative correlation should exist between synonym groups 14/05/2019 Sagar Khushalani

Matching: Ranking Each matching needs to be ranked – this allows comparison of matchings The rank of a matching is the maximal negative correlation of a pair of attribute groups in the matching Cmax(Mj,mn) = max mn (Gjr, Gjt) for all Gjr & Gjt where r != t The score, combined with semantic subsumption, allows matchings to be ranked 14/05/2019 Sagar Khushalani

Matching: Selection Step III: Complex Matching Selection Complex Matching can create false matchings. E.g.: {author} = {first name, last name} & {subject} = {first name, last name} Why? These matchings must conflict: Solution: remove conflicts based on negative correlation 14/05/2019 Sagar Khushalani

Final Algorithm Find the highest ranked matching Mt in each iteration and add it to the set Remove matchings in the group that are inconsistent with M Return final set of matchings 14/05/2019 Sagar Khushalani

Putting it together As shown before, MetaQuerier consists of various subsystems The subsystems were developed concurrently. However, there are certain problems with this method: What level of accuracy is good enough? Is it possible to make a subsystem more accurate? Each subsystem depends on the previous ones: Is one subsystem’s accuracy enough for the next subsystem’s functions? 14/05/2019 Sagar Khushalani

Putting it together Observation Joining the subsystems may require higher accuracy for each individual subsystem Information from later subsystems can be fed back to previous ones to increase accuracy Solution: To sustain accuracy of SM: Ensemble To improve accuracy of IE: Feedback 14/05/2019 Sagar Khushalani

Ensemble Procedure: Execute matcher on a smaller, random, sample of input schemas Each set of schemas is called a trial Run multiple matchers, where each matcher is executed over an independent trial of schemas 14/05/2019 Sagar Khushalani

Feedback: Domain Statistics Observation: IE can take feedback from SM to improve its accuracy Such information that is passed back is obtained by voting between similar query interfaces, and is known as a domain statistic. There are three types of domain statistics: Type of attributes Frequency of attributes Correlation of attributes 14/05/2019 Sagar Khushalani

Feedback: Attribute Types Type of attributes: Commonly occurring attributes will have a type that can be used for parsing. Eg. ISBN is a numeric attribute 14/05/2019 Sagar Khushalani

Feedback: Attribute Frequency Frequency of attributes: Certain attributes are very common in a particular domain, eg. “last name” “Last Name” is a much more common attribute in a query interface than “e.g. Mike” This tells the SM system that “Last Name” is probably the right attribute for the text box 14/05/2019 Sagar Khushalani

Feedback: Correlation Given that “adults” and “children” have a positive correlation, and both have a negative correlation with “passengers”. If “children” is definitely an identified attribute, then SM will choose “adults” to be the other attribute, ignoring “passengers” 14/05/2019 Sagar Khushalani

Feedback: Domain Statistics How to combine these three rules (type, frequency, correlation)? The authors use the following strategy: 3 -> 2 -> 1 14/05/2019 Sagar Khushalani

Review What is MetaQuerier? System Architecture Database Crawler Interface Extraction Schema Matching Data Preparation Correlation Measure Matching Putting it together – issues and solutions 14/05/2019 Sagar Khushalani

Questions? Sources: Attributed Multi-set Grammar: http://portal.acm.org/citation.cfm?id=864839 Grammar: http://en.wikipedia.org/wiki/Grammar_(computer_science)#Formal_definition Deep Web: http://en.wikipedia.org/wiki/Deep_Web Papers by the authors: http://eagle.cs.uiuc.edu/pubs/2004/parsing-sigmod04-zhc-mar04.pdf http://eagle.cs.uiuc.edu/pubs/2003/unifiedschema-sigmod03-hc-mar03.pdf http://eagle.cs.uiuc.edu/pubs/2004/dwsurvey-sigmodrecord-chlpz-aug04.pdf http://eagle.cs.uiuc.edu/pubs/2004/complexmatching-sigkdd04-hch-jun04.pdf 14/05/2019 Sagar Khushalani