WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Information Retrieval in Practice
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.
Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms Prerak Sanghvi Paper by: Hsinchun Chen Artificial.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Data Mining – Intro.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Chapter 10: Information Integration and Synthesis.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
Webpage Understanding: an Integrated Approach
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Modeling, Searching, and Explaining Abnormal Instances in Multi-Relational Networks Chapter 1. Introduction Speaker: Cheng-Te Li
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
Merging Source Query Interfaces on Web Databases Eduard C. Dragut (speaker) Wensheng Wu Prasad Sistla Clement Yu Weiyi Meng Eduard C. Dragut (speaker)
Presenter: Shanshan Lu 03/04/2010
XML Schema Integration Ray Dos Santos July 19, 2009.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
VLDB Demo WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web Hai He, Weiyi Meng, Clement Yu, Zonghuan.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Exploiting Relevance Feedback in Knowledge Graph Search
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Information Retrieval in Practice
Queensland University of Technology
Statistical Schema Matching across Web Query Interfaces
Meaningful Labeling of Integrated Query Interfaces
Chapter 10: Information Integration and Synthesis
MatchCatcher: A Debugger for Blocking in Entity Matching
iSRD Spam Review Detection with Imbalanced Data Distributions
[jws13] Evaluation of instance matching tools: The experience of OAEI
Panagiotis G. Ipeirotis Luis Gravano
CS246: Information Retrieval
Toward Large Scale Integration
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006

2 Search Problems on the Deep Web united.comairtravel.com delta.com Find round-trip flights from Chicago to New York under $500

3 Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com Global query interface comparison shopping systems “on steroid”

4 Current State of Affairs Very active in both research communities & industry Research –multidisciplinary efforts: Database, Web, KDD & AI –10+ research groups in US, Asia & Europe –focuses: –source discovery –schema matching & integration –query processing –data extraction Industry –Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …

5 Key Task: Schema Matching 1-1 match Complex match

6 Schema Matching is Ubiquitous! Fundamental problem in numerous applications –data integration –data warehousing –peer data management –ontology merging –view integration –personal information management Schema matching across Web sources –30+ papers generated in past few years –Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB- 04], Utah [WebDB-05], …

7 Schema Matching is Still Very Difficult Must rely on properties of attributes, e.g., label & instances Often there are little in common between matching attributes Many attributes do not even have instances! 1-1 match Complex match

8 Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances 28.1% ~ 74.6% of attributes with no instances Extremely challenging to match these attributes –e.g., does departure city match from city or departure date? Also difficult to match attributes with dissimilar instances –e.g., airline (with American airliners) vs. carrier (with Europeans)

9 Our Solution: Exploit the Web Discover instances from the Web –e.g., Chicago, New York, etc. for departure city & from city Borrow instances from other attributes & validate via Web –e.g., check if Air Canada is an instance of carrier with the Web

10 Key Idea: Question-Answering from AI Search Web via search engines, e.g., Google … but search engines do not understand natural language questions Idea: form extraction queries as sentences to be completed “Trick” search engine to complete sentences with instances Example extraction query: “departure cities such as” Extraction Patterns Ls such as NP1, … NPn such Ls as NP1, …, NPn NP1, …, NPn, and other Ls Ls including NP1, …, NPn attribute label: departure city

11 Key Idea: Question-Answering from AI Search Google & obtain snippets: Extract instance candidates from snippets: other departure cities such as Boston, Chicago and LAX available … Boston, Chicago, LAX extraction querycompletion

12 But Not Every Candidate is True Instance Reason 1: Extraction queries may not be perfect Reason 2: Web content is inherently noisy Example: –attribute: city –extraction query: “and other cities” –extracted candidate: 150  need to perform instance verification

13 Instance Verification: Outlier Detection Goal: Remove statistical outliers (among candidates) Step 1: Pre-processing –recognize types of instances via pattern matching & 80% rule –types: numeric & string –discard all candidates not of determined type –e.g., most of instance candidates for city are strings, so remove 150 Step 2: Type-specific detection –perform discordance tests –test statistics, e.g., –# of words: abnormal if more than 5 words in person name –% of numeric characters: US zip code contains only digits

14 Instance Verification: Web Validation Goal: Further semantic-level validation Idea: Exploit co-occurrence statistics of label & instances –“Make: Honda; Model: Accord” –“a variety of makes such as Honda, Mitsubishi” Form validation queries using validation patterns –e.g., “make Honda”, “makes such as Honda” Validation Patterns (V + x) L xL x Ls such as x such Ls as x x and other Ls Ls including x Validation phrase V

15 Instance Verification: Web Validation Possible measure: NumHits(V+x) –e.g., NumHits(“cities such as Los Angeles”) = 26M Potential problems: bias towards popular instances Use PMI(V, x), point-wise mutual information Example: –V = “cities such as”, candidates: California, Los Angeles –NumHits(V, California) = 29 –PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)

16 Validate Instances from Other Attributes Method 1: Discover k more instances from Web –then check for borrowed one (Aer Lingus for Airline)  problem: very likely Aer Lingus not among discovered instances Method 2: Compare validation score with that of instance  problem: score for Aer Lingus may be much lower, how to decide? Key observation: compare also to scores of non-instances –e.g., Economy (with respect to Airline)

17 Train Validation-Based Instance Classifier Naïve Bayes classifier with validation-based features ExampleM1M2+/- Air Canada.5.3+ American.8.1+ Economy First Class Delta.6.3+ United.9.4+ Jan Examplef1f2+/- Delta11+ United11+ Jan Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 … V1: Airlines such as V2: Airline

18 Validate Instances via Deep Web Handle attributes while difficult via Web, e.g., from Disadvantage: ambiguity when no results found

19 Architecture of Assisted Matching System Instance acquisition Interface matcher Source interfaces with augmented instances Attribute matches

20 Empirical Evaluation Five domains: Experiments: –Baseline: IceQ [Wu et al., SIGMOD-04] –Web assistance Performance metrics: –precision (P), recall (R), & F1 (= 2PR/(P+R)) Domain # schemas # attributes per schema % of attributes with no instances Average depth of schemas Airfare Automobile Book Job Real Estate

21 Matching Accuracy Web assistance boosts accuracy (F1) from 89.5 to 97.5

22 Overhead Analysis Reasonable overhead: 6~11 minutes across domains

23 Conclusion Search problems on the Deep Web are increasingly crucial! Novel QA-based approach to learning attribute instances Incorporation into a state-of-art matching system Extensive evaluation over varied real-world domains  More details: Wensheng Wu on Google