Yesterday in a talk this slide was presented.

Slides:



Advertisements
Similar presentations
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Advertisements

Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Patch to the Future: Unsupervised Visual Prediction
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Aki Hecht Seminar in Databases (236826) January 2009
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
Summarizing Conversations with Clue Words Giuseppe Carenini Raymond T. Ng Xiaodong Zhou Department of Computer Science Univ. of British Columbia.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
Using linked data to interpret tables Varish Mulwad September 14,
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
David Amar, Tom Hait, and Ron Shamir
Neighborhood - based Tag Prediction
Semi-Supervised Clustering
Sebastian Neumaier Advisor: Univ.Prof. Dr. Axel Polleres Co-Advisor:
Linguistic Graph Similarity for News Sentence Searching
Presented by Khawar Shakeel
System for Semi-automatic ontology construction
Sofus A. Macskassy Fetch Technologies
Constrained Clustering -Semi Supervised Clustering-
Web News Sentence Searching Using Linguistic Graph Similarity
Lifting Data Portals to the Web of Data
Huazhong University of Science and Technology
Associative Query Answering via Query Feature Similarity
Metadata Quality: Learning from Open Data Portalwatch
Building an Open Knowledge Graphs for and from Open Data
Semantic Interoperability and Data Warehouse Design
Liang Zheng and Yuzhong Qu
An Empirical Study of Property Collocation on Large Scale of Knowledge Base 龚赛赛
Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016
iSRD Spam Review Detection with Imbalanced Data Distributions
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Block Matching for Ontologies
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Leverage Consensus Partition for Domain-Specific Entity Coreference
Enriching Taxonomies With Functional Domain Knowledge
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Introduction Dataset search
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Yesterday in a talk this slide was presented. Our approach which I will present now is maybe a way to tackle this problem.

Multi-level semantic labelling of numerical values Sebastian Neumaier,1 Jürgen Umbrich,1 Josiane Xavier Parreira,2 Axel Polleres1 1. Vienna University of Economics and Business, Vienna, Austria 2. Siemens AG Österreich, Vienna, Austria

Motivation: Open Data Available information is (partially) structured and tabular [1]: 3-star 2-star 1-star A few years ago many public institutions and governments started to open up some of their data and published it on central points. This datasets certainly contain interesting and useful information like transport, geography, economics or science data. However, when we looked in more detail into these datasets, we found out… 82 data portals 160K datasets Exploiting the potential: Improve existing 3-star (CSV) Open Data Propose semantic labels/context for the content of columns Integrate and interlink data [1] Umbrich, J., Neumaier, S., Polleres, A.: Quality assessment & evolution of open data portals. In: IEEE International Conference on Open and Big Data, Rome, Italy (2015)

Example dbp:capacity dpo:City dpo:Stadium dpo:Country stadium name Emirates Stadium 60361 London England Villa Park 42785 Birmingham Ewood Park 31154 Blackburn … Of course we are not the first ones who work with tabular data

But: Web/HTML tables differ from typical Open Data tables: Domain: e.g., public administration data, statistical data, weather data, elections, … Structure: OD tables contain large amount of numerical columns

Example (Cont’d) stadium capacity city country Emirates Stadium 60361 London England Villa Park 42785 Birmingham Ewood Park 31154 Blackburn … Realistic open data csv would look more like this Which makes the previous showed job much more harder

Example (Cont’d) TOTAL DISTRICT_CODE ISO_2 Emirates Stadium 60361 SW1A 0AA GB Villa Park 42785 B23 7QG Ewood Park 31154 B26 6QA … Realistic open data csv would look more like this Which makes the previous showed job much more harder

Why not use numeric values? Identifying the most likely semantic label for a bag of numerical values Deliberately ignore surroundings TOTAL DISTRICT_CODE ISO_2 Emirates Stadium 60361 SW1A 0AA GB Villa Park 42785 B23 7QG Ewood Park 31154 B26 6QA …

Why not use numeric values? Identifying the most likely semantic label for a bag of numerical values Deliberately ignore surroundings Emirates Stadium 60361 SW1A 0AA GB Villa Park 42785 B23 7QG Ewood Park 31154 B26 6QA …

Why not use numeric values? Identifying the most likely semantic label for a bag of numerical values Deliberately ignore surroundings 60361 42785 31154 …

Why not use numeric values? Identifying the most likely semantic label for a bag of numerical values Deliberately ignore surroundings capacity <a stadium> <country England> 60361 42785 31154 …

Our Approach k-nearest neighbors search Hierarchical clustering over an RDF knowledge base to build background knowledge graph (BKG) nodes consist of typical numerical values, annotated with context information, i.e.: grouped by properties and their shared domain (subject) pairs k-nearest neighbors search Aggregation of the results at different levels to find the most likely context: property type context

1. Background Knowledge Graph Find properties with numerical range Hierarchical clustering approach Two hierarchical layers: Type hierarchy (using OWL classes) Property-object hierarchy (shared property-object pairs)

2. k-Nearest neighbor search Mapping bags of numerical value to vector space (feature vector)

2. k-Nearest neighbor search (cont’d) Compute & rank k-nearest neighbours for input values 1) input: [ 187, 201, 199, 198, 195, 199, 203, … ] 2) mapping: 3) compute distance to neighbours 4) select K nearest 2 4 3 6 1 5

3. Result Aggregation 2 4 3 6 1 5

Evaluation: Setup Data Aggregation Levels Aggregation Function DBPedia 3.9 50 most frequent numerical properties Aggregation Levels Aggregation Function majority vote and average distance Evaluation of different distance functions Best: Kolmogorov-Smirnov (KS) distance property type p-o level

Evaluation 33657 Test Nodes Majority vote slightly better More neighbors also better Top-5 shows already clear better results than top-1 9% of test nodes are contained 1-1 in knowledge graph !! Accuracy results:

Experimental OD Column labelling Data from two selected Open Data portals 1170 CSV tables Manual inspection of top 100 tables Lessons learned: Missing domain knowledge Timeline data Combine with (existing) complementary approaches

Conclusions Semantic labelling of numerical values Hierarchical unsupervised background knowledge (BK) We can assign fine-grained semantic labels if there is enough evidence in BK Complementary to existing approaches Future work Find and integrate more background knowledge Solve the domain mismatch between Open Data and existing KGs Potentially applicable in other use cases Sebastian Neumaier WU Vienna, Institute for Information Business email: sebastian.neumaier@wu.ac.at url: https://sebneumaier.wordpress.com/ twitter: @sebneum

Backup Slides

BKG: Type hierarchy Represents rdfs:subClassOf relation for all available types Construction: Find properties with numerical range Collect entities and all of their p-o pairs Materials the OWL class hierarchy Form a cluster for each type, containing all entity information

BKG: p-o hierarchy Hierarchical clustering approach Start with one node cluster Build candidates: constrain property-object: all subject share the same property-object pair constrain size: candidate nodes are larger than 1% of the parent node and smaller than 99% Sort candidates by their distance Select candidate with largest distance, subsequently select non-overlapping candidates

Evaluation: Setup 30 GB RAM 3 different knowledge bases: DBpedia properties:

Evaluation: Test/train data train/test split : 80/20 20% of the subjects for each property as test data test context graph: similar as background construction, however, without constrains randomly select leaf nodes

Evaluation: Distance Function 1787 Test Nodes Best: Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o prop = correct property type = correct type stype = correct super type

Open Data (OD) tables Open Data tables typically contain a large portion of numerical columns missing headers and/or non-textual headers BabelNet solutions that solely focus on textual “cues” are only partially applicable for mapping such data sources

Lessons Learned from OD tables Missing domain knowledge Open Data is potentially very domain specific Mismatch between knowledge bases like DBpedia and Open Data Enrich knowledge graph with domain knowledge (e.g., extracted from Open Data tables) Timeline data values for different time points not in DBpedia Detect time dependency and regroup/transform tables Include complementary approaches Deliberately excluded in this paper Linguistic clues and string similarity measures (cf. approach by Pham et al.)