Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Slides:

Advertisements

Similar presentations

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Advertisements

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Aki Hecht Seminar in Databases (236826) January 2009

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.

1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Querying Structured Text in an XML Database By Xuemei Luo.

WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources Gagan Agrawal Fan Wang, Tantan Liu Ohio State University.

EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.

JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Text Clustering Hongning Wang

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.

Data Integration for Relational Web

[jws13] Evaluation of instance matching tools: The experience of OAEI

Stratified Sampling for Data Mining on the Deep Web

Block Matching for Ontologies

Magnet & /facet Zheng Liang

Ying Dai Faculty of software and information science,

Answering Cross-Source Keyword Queries Over Biological Data Sources

Toward Large Scale Integration

Supporting High-Performance Data Processing on Flat-Files

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

WSExpress: A QoS-Aware Search Engine for Web Services

Presentation transcript:

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa, August 27, 2010

Outline Introduction Problem Definition An Example Scenario Model of schemas Main Approach Experiment Conclusion

Introduction Deep Web Data Source: –Query interface vs. Backend database –Input Schema vs. Output Schema Input attributes vs. Output attributes An Example of Deep Web Data Source (SNP500 Cancer)

Motivation Vast information hidden in the deep web Research on searching and integrating the deep web Challenge: Interdependence between deep web data sources. Key Issue: Discovering the Input-output semantic relation of the deep web! Critical Need: Automatic or Semi-automatic Integration

An Example Query ( Multiple Interdependent Deep Web Data Sources)

Overall Context Discover data source metadata Generate query plans for search Query caching mechanismFault Tolerance mechanism

Problem Definition Schema Matching –Finding the semantic correspondence between attributes Three types of schema Matching –Input schema matching Provide a unified interface for user –Output schema matching query mediation and data integration –Input-Output schema matching Enable search across multiple deep web data sources Goal: Input-Output schema matching

Model of Schemas Input Schema –Describing input attributes on the query interface –An input Attribute Corresponding to a text input box on the interface Represented by –Label is the text surrounding the attribute – Instance set We focus on Text-query based interface

Model of Schemas (cont’d) Output Schema –Describing output attributes on the output webpage A hierarchical model –Related attributes are in a table or a separate block –Leaf node: attribute in the schema –Internal node: A group of attributes –An output attribute: L: label I: Instance set P: parent’s label S: siblings Web Source Example

Main Approach Task –Identifying the input-output semantic mapping of multiple data sources Two components –Finding instances for input attributes From query interface From output webpages of data sources –Schema matching via clustering Mapping attributes are grouped together

Discovering instances Observation: –Help webpages are provided by the deep web Through links on the query interface In order to help users to query the data source Containing useful instances –Web Source ExampleWeb Source Example A method for discovering instances for input attribute with label L

Discovering instances (cont’d) Identifying potential help webpages –Useful links on the interface directing to help webpages –Useful links are identified by keywords help, search hints, sample, about, how…. Locating instances from help webpages –Surrounded by the meaningful keywords such as, :, (), e.g., for example, for instance, like, label L –Sentences contain the keyword are extracted

Discovering instances (cont’d) Discovering Potential Instances –Idea: Biological terms are less used in other domains –A large number of documents are collected from six domains economics, science, politics, arts, sports, history –Computing document frequency of each term in the large collection –Processing each term in extracted sentences A term is a potential instance if its document frequency is less than a threshold Validating each potential instance through the interface

Discovering instances (cont’d) Output webpage - another source for instances –Sometimes, no instance is provided by a interface –The quantity of instances is small –Multiple data sources have interdependence –Borrowing instances from output attributes A dynamic algorithm for learning instances from output webpages

Discovering Instances (cont’d) Step 1: Initial input instances discovered from help webpages Step 2: Output attributes with their instances are obtained by instances for input attributes Step 3: –For each input attribute, instances are borrowed from output attributes –Output attributes with higher semantic similarity have higher priority –Go to Step 2 Stopping criteria –The instance sets for all input attributes are larger than a threshold – No more output attributes or instances are discovered

Main Approach-Learning instances From Output webpages (cont’d) Web Source Example

Similarity Evaluation A criteria to evaluate the semantic similarity between two attributes: and Similarity of Label: Similarity of Type: Similarity of Value: Similarity of Domain: Similarity of Parent: Similarity of Sibling:

Similarity Evaluation (cont’d) Similarity of Label –Linguistic similarity –Vector space model For two labels s and t Each label is modeled by a vector Cosine function

Similarity Evaluation (cont’d) Similarity of Type –Type similarity is 1 for the same type of attributes –Type: String & Numeric Similarity of Value –Best Match algorithm –The pair of the instances with the largest similarity is matched iteratively. Similarity of Domain –For numeric attributes –Overlap in the ranges of instances Parent Similarity –Linguistic Similarity Sibling Similarity –Best Match

Schema Matching Schema matching is based on a clustering process –Initially, each attribute is a cluster –Two Clusters with the largest similarity are merged repeatedly –The repetition stops if the largest similarity is smaller than a threshold –The similarity between two clusters is the average similarity of the attributes in the two clusters Attribute Mapping –Attributes in each cluster are mapped to each other. –An input attribute and an output attribute in the same cluster reveals an input-output relation –A cluster contains more than one attributes from the same data source These attributes mapped to other attributes in the cluster The former attributes are called simple attributes The latter attributes are called composite attributes.

Experimental Evaluation Data Set –11 data sources with 24 query interfaces –Data about SNP, Gene, Protein and related information Instances discovered from Interface

Evaluation Metrics Precision –The percentage of the correct mappings over all mappings identified by our algorithm. Recall – The percentage of the correct mappings identified by our algorithm over all mappings in the data set F measure

Experiment (cont’d) All types of schema matching Accuracy of all types of schema matching

Experiment (cont’d) Data sources are divided into two sets: –Simple Set: data sources only have simple attributes –Composite Set: data sources contain composite attributes Attributes is divided into two sets: –String attribute vs. Numeric attribute

Conclusion An algorithm for automatic input-output schema matching on biological deep web data sources. –Use query instances –Use output from related data sources A clustering approach is used to identify the semantic mapping of attributes Our algorithm achieves good performance on biological data sets.

Questions & Comments?