Linked Data Profiling Andrejs Abele UNLP PhD Day Supervisor: Paul Buitelaar.

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Text Classification With Support Vector Machines
Optimizing F-Measure with Support Vector Machines David R. Musicant Vipin Kumar Aysel Ozgur FLAIRS 2003 Tuesday, May 13, 2003 Carleton College.
Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.
Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December /02/11.
DOG I : an Annotation System for Images of Dog Breeds Antonis Dimas Pyrros Koletsis Euripides Petrakis Intelligent Systems Laboratory Technical University.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Mining Binary Constraints in the Construction of Feature Models Li Yi Peking University March 30, 2012.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Recommender Systems. Outline Limitations of Recommender Systems SMARTMUSEUM Case Study.
2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Sentosa Technology Consultants | | KDDI R&D Laboratories Inc. Automatic Content Filtering KDDI R&D Laboratories Inc.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Using linked data to interpret tables Varish Mulwad September 14,
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Class Imbalance in Text Classification
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Musical Genre Categorization Using Support Vector Machines Shu Wang.
Stuart Macdonald AddressingHistory Project Manager EDINA To create an online crowdsourcing tool that will combine.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
© Copyright 2015 STI INNSBRUCK PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.
UNIT 2: MEASUREMENT Topics Covered:  Significant Digits.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
My Tiny Ping-Pong Helper
Sentiment Analyzer Using a Multi-Level Classifier
Ying He Wuhan University of Technology
PEBL: Web Page Classification without Negative Examples
Text Based Similarity Metrics and Delta for Semantic Web Graphs
Presented by Steven Lewis
Multiple Instance Learning: applications to computer vision
Block Matching for Ontologies
Ontology Learning – Some Advances
Leverage Consensus Partition for Domain-Specific Entity Coreference
Feature Selection for Ranking
Concave Minimization for Support Vector Machine Classifiers
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
The experiments based on word-embedding and SVM
Hierarchical, Perceptron-like Learning for OBIE
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Linked Data Profiling Andrejs Abele UNLP PhD Day Supervisor: Paul Buitelaar

Overview  Motivation  My approach  Experiments  Future work

Motivation  Linked Data is hard to understand for humans  Only a small number of datasets provide a human readable overview or comprehensive metadata  When adding a new dataset to the LOD cloud, connections have to be identified to as many other relevant LOD datasets as possible  LOD Cloud Diagram relays on human classification

Domain identification method using DBpedia Topic Extraction Domain Identification Domain

Experiments 1.Extract classes and properties and run SVM 2.Identify domain by using DBpedia category structure 3.Run SVM on extracted DBpedia concepts (terms that where linked to DBpedia) 4.Run SVM on extracted DBpedia categories 5.Run SVM on combination of classes and properties + DBpedia concepts 6.Run SVM on combination of classes and properties + DBpedia categories 7. Manually reclassify datasets based only on literals 8.Manually calculate best maximal accuracy

Datasets LOD cloud datasets (annotated in LOD Cloud Diagram) 342 datasets, 9 domains Media(8) Linguistics(13) Publication(88) Social_Networking(41) Geography(19) Government(65) Cross_domain (23) User_generated(53) Life_science(32)

1.Extract URIs of properties and classes from datasets 2.Use classes and properties as features 3.Classify using Support Vector Machine classifier (C-SVC) 4.Use Precision and Recall as metrics 1. Experiment

Precision and Recall for different domains using SVM Correctly classified: 249 Processed:342 Accuracy: %

1.Extract Literals 2.Calculate TF 3.Select literals containing top 100 terms (based on TF) 4.Extract topics 5.Get Categories 6.Calculate distance between top 50 categories and 7 predefined Domains (max distance 5 ) 7.Select domain with shortest distance 8.Use Precision and Recall as metrics 2. Experiment

Precision and Recall for different domains using DBpedia classification Correctly classified: 71 Processed:342 Contained dataset information :306 Contains first 7 domains : 235 Accuracy: 23.20% Accuracy if counted only 7 domains: 30.21%

Domain mapping Publicationshttp://dbpedia.org/page/Category:Publications Life Scienceshttp://dbpedia.org/page/Category:Biology Cross-Domain Social Networkinghttp://dbpedia.org/page/Category:Social_networks Geographyhttp://dbpedia.org/page/Category:Geography Governmenthttp://dbpedia.org/page/Category:Government Mediahttp://dbpedia.org/page/Category:Digital_technology User-Generated Content Linguisticshttp://dbpedia.org/page/Category:Linguistics

1.Extract Literals 2.Calculate TF 3.Select literals containing top 100 terms (based on TF) 4.Extract topics 5.Use topics (DBpedia concepts) as features 6.Classify using Support Vector Machine classifier (C-SVC) 7.Use Precision and Recall as metrics 3. Experiment

Precision and Recall for different domains using SVM and DBpedia topics Correctly classified: 154 Processed:342 Accuracy: % 34 datasets contain no identified topics: Cross_domain =1, Geography = 2 Government = 6, Life_sciences = 2 Linguistics = 5, Publications = 8 Social_networking = 6, User_generated = 4

1.Extract Literals 2.Calculate TF 3.Select literals containing top 100 terms (based on TF) 4.Extract topics 5.Get Categories 6.Use Categories (DBpedia Categories) as features 7.Classify using Support Vector Machine classifier (C-SVC) 8.Use Precision and Recall as metrics 4. Experiment

Precision and Recall for different domains using SVM and DBpedia categories Correctly classified: 138 Processed:342 Accuracy: % 34 datasets contain no identified topics: Cross_domain =1, Geography = 2 Government = 6, Life_sciences = 2 Linguistics = 5, Publications = 8 Social_networking = 6, User_generated = 4

1.Extract URIs of properties and classes from datasets 2.Use classes and properties as features 3.Extract Literals 4.Calculate TF 5.Select literals containing top 100 terms (based on TF) 6.Extract topics 7.Use topics (DBpedia concepts) and classes, and properties as features 8.Classify using Support Vector Machine classifier (C-SVC) 9.Use Precision and Recall as metrics 5. Experiment

Precision and Recall for different domains using SVM and DBpedia concepts Correctly classified: 215 Processed:342 Accuracy: %

1.Extract URIs of properties and classes from datasets 2.Use classes and properties as features 3.Extract Literals 4.Calculate TF 5.Select literals containing top 100 terms (based on TF) 6.Extract topics 7.Get Categories 8.Use Categories and classes, and properties as features 9.Classify using Support Vector Machine classifier (C-SVC) 10.Use Precision and Recall as metrics 6. Experiment

Precision and Recall for different domains using SVM and DBpedia categories Correctly classified: 138 Processed:342 Accuracy: %

1.Extract Literals 2.Without any other information classify datasets based only on literals 7. Experiment

Data analysis There are dataset where algorithm agreed with original classification, but I as a human annotator, I had to little information: VULCAN VENTURES INC SEC ( via Linked Edgar ( No guarantee of correctness! USE AT YOUR OWN RISK! It was identified as government, but for me was to little information My annotation didn’t significantly improved results, but it was, because during annotation for many dataset I couldn’t assignee them to any of the predefined domains

1.Extract Literals 2.Without any other information classify datasets based only on literals 3.Manually analyse list of domain provided by my approach 4.Analyse if domain list provide any insight, connected to dataset Results Number of datasets: 336 Possibly correct answers: 261 Percentage: 77.68% 8. Experiment

Experiments 1.Extract classes and properties and run SVM 2.Identify domain by using DBpedia category structure 3.Run SVM on extracted DBpedia concepts (terms that where linked to DBpedia) 4.Run SVM on extracted DBpedia categories 5.Run SVM on combination of classes and properties + DBpedia concepts 6.Run SVM on combination of classes and properties + DBpedia categories 7. Manually reclassify datasets based only on literals 8.Manually calculate best maximal accuracy

Future work 1.Identify better categories Ontologies, Technology, People, … 2.Create Better domain mapping for my approach 3.Create hybrid approach (Identify type of dataset, identify domain of content )

Thank you!