Harnessing manpower for creating semantics (doctoral dissertation) Jakub Šimko Institute of Informatics and Software Engineering,

Slides:



Advertisements
Similar presentations
Personalized Presentation in Web-Based Information Systems Institute of Informatics and Software Engineering Faculty of Informatics and Information Technologies.
Advertisements

The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Hypertext, hypermedia and interactivity. A brief overview and background primer.
/ Where innovation starts 1212 Technische Universiteit Eindhoven University of Technology 1 Incorporating Cognitive/Learning Styles in a General-Purpose.
Game Theoretic Aspect in Human Computation Presenter: Chien-Ju Ho
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Harnessing manpower for creating semantics (doctoral dissertation) Jakub Šimko Institute of Informatics and Software Engineering,
A one player game where players are asked to tag funny video clips in a given time frame. They will score points throughout the game and be entered into.
ACM Multimedia th Annual Conference, October , 2004
Industrial Ontologies Group Oleksiy Khriyenko, Vagan Terziyan INDIN´04: 24th – 26th June, 2004, Berlin, Germany OntoSmartResource: An Industrial Resource.
Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Knowledge Representation Reading: Chapter
A Value-Based Approach for Quantifying Scientific Problem Solving Effectiveness Within and Across Educational Systems Ron Stevens, Ph.D. IMMEX Project.
Game for acquisition of multimedia semantics Martin Polakovi č Ing. Jakub Šimko PhD.
Building an Ontological Base for Experimental Evaluation of Semantic Web Applications Peter Bartalos, Michal Barla, Gyorgy Frivolt, Michal Tvarožek, Anton.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Faculty of Informatics and Information Technologies Slovak University of Technology Personalized Navigation in the Semantic Web Michal Tvarožek Mentor:
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
On the edge: designing online modules in EAP George Blue Julie Watson Vicky Wright
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
A Cognitive Substrate for Natural Language Understanding Nick Cassimatis Arthi Murugesan Magdalena Bugajska.
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
Recording application executions enriched with domain semantics of computations and data Master of Science Thesis Michał Pelczar Krakow,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Systematization of Crowdsoucing for Data Annotation Aobo, Feb
Expert Finding and Metadata Generation with Games with a Purpose Peter Dulačka Jakub Šimko.
MULTIMEDIA DEFINITION OF MULTIMEDIA
WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.
Lecture 2 Jan 13, 2010 Social Search. What is Social Search? Social Information Access –a stream of research that explores methods for organizing users’
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
1 Growing the Semantic Web with Inverse Semantic Search Hans-Jörg Happel, FZI Karlsruhe 1st Workshop on Incentives for the Semantic Web (INSEMTIVE 2008)
Fundamentals of Information Systems, Sixth Edition1 Natural Language Processing and Voice Recognition Processing that allows the computer to understand.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Faculty of Informatics and Information Technologies Slovak University of Technology Personalized Navigation in the Semantic Web Michal Tvarožek Mentor:
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects Jakub Šimko, Marián Šimko, Mária Bieliková, Jakub Ševcech, Roman Burger
MICHAL TVAROŽEK, MICHAL BARLA, GYÖRGY FRIVOLT, MAREK TOMŠA, MÁRIA BIELIKOVÁ Improving Semantic Search via Integrated Personalized Faceted and Visual Graph.
Some questions -What is metadata? -Data about data.
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Gaze-Tracked Crowdsourcing Jakub Šimko, Mária Bieliková
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Human computation Jakub Šimko Slovak University of Technology in Bratislava, Faculty of Informatics and Information Technologies,
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Short Video Metadata Acquisition Game Aleš Mäsiar, Jakub Šimko
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Artificial Intelligence, simulation and modelling.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Topics in socio-technical Elena Simperl 04 December 2014.
Information Organization: Overview
Slovak University of Technology in Bratislava,
Information Organization: Overview
Presentation transcript:

Harnessing manpower for creating semantics (doctoral dissertation) Jakub Šimko Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava Supervised by: prof. Mária Bieliková July 4 th, 2013

Thesis overview

Thesis Goals 1. Create new, GWAP-based approaches to semantics creation, particularly for specific domains 2. Bring in generally applicable improvements to GWAP design, focusing on selected problems

Semantics acquisition Semantics needed everywhere Resource metadata acquisition ◦ Resource types: texts, multimedia, websites Domain modelling ◦ Concept identification, Relationships identification, labelling, Interconnecting of datasets

Semantics acquisition Output quality Output quantity Crowdsourcing Automated Expert Quick Inexpensive (once created) Scalable [3,4] Human based Scalable No specific problems We still need to pay [5,6] Expensive Essential for certain tasks [1,2]

Games with a purpose Cheap (once they are created) Difficult to create Often used for semantics acquisition tasks [6,7]

ESP Game: image metadata acquisition What is in the image? Player 1: Player 2: water sky bridge Mostar night river bridge Bosnia The players must blindly match Banned words: blue, towers [7]

Our taxonomy of GWAPs

Our GWAP design dimensions

Existing GWAPs in our design space

Little Search Game (negative search game) Search query: „Star –movie –war –death“ Result number decrease = points Logs processed to term network

LSG evaluation: term network soudness Recorded data 300 players queries 3200 suggested rels. 400 nodes, 560 edges Method A posteriori Group of judges H: term-term relationship is sound Results 91% correctness

Hidden term relationships

Hidden term relationships – reality

LSG evaluation: hidden relationships Data 400 nodes, 560 edges Most used word lists: 800, 5000, Web search index (Bing) Method Co-occurrence of terms in LSG rels. Co-occurrence of random term pairs Noise level indentification Results Medium sized corpus – Noise level: 0.35 – Hidden relationship ratio: 40%

PexAce: image annotation game

PexAce: image annotation Annotations Currently disclosed pair

PexAce: image annotation Annotation “tooltip”

General domain: Deployment Corel 5K dataset: photos + tags + our tags 107 players, 814 games, images annotations, tags Golden standard comparison ◦ Precision 73% and Recall 26% Aposteriori evaluation ◦ 3 independent judges ◦ 94% of tags was correct

Personal images What if we change the image corpus to personal albums? ◦ Players like that more ◦ They provide specific annotations (metadata) Potential problem? Validation ◦ We can hardly apply cross-player validation of tags

„Benevolent“ artifact validation model Original mutual player supervision Less strict heuristics Annotations decomposed to votes: P - players, T- terms, I - Images

Personal images: Experiments Two social groups in each: ◦ 2 players, 1 judge ◦ A set of 48 images in albums  Portraits, Groups, Situational and Non-person (other) ◦ One group was aware of the purpose, the other was not Each player played 3 games Each image was featured twice for a single player Measured properties of tags ◦ Correctness ◦ Specificity ◦ Understandability ◦ Type of tag (person, event, place, other)

Personal images: Experiments Aware (253 tags)Unaware (108 tags) Corr.Spec.Und.Corr.Spec.Und. Portraits Groups Situations Other Average Persons (56%) Events (21%) Places (14%) Other (11%)

Artifact validation and cold start problem „How can a result of a human intelligence task be automatically evaluated?“ GWAPs use: ◦ Mutual player supervision ◦ Approximative or exact automated evaluation (case dependent) Threat to multiplayer validation schemes: ‘’ The requirement is to have multiple players online at the same time, sometimes with a requirement that they cannot communicate.” Keep the games single-player

Helper artifacts: a new artifact validation principle Helper artifacts: ◦ Decouple scoring from task solving, instead motivate players to solve tasks to help themselves in the progress of the game ◦ E.g. in PexAce, a player may win the game well enough even without the annotations

GWAP player competences 1. Quantify player skills – player model (e.g. player’s expertise for each sub-domain) 2. Apply model in a)Solution filtering (e.g. vote weighting) b)Task assignment (e.g. match task subdomain to expertise areas) 3. Speed up the process or/and retrieve higher quality results

PexAce dataset: Usefulness (delivery of correct artifacts) Consensus ratio (agreement with other players) Correlation: 0.496

CityLights: music tag validation Validation question: “Which of these tag groups characterizes the music track you hear?” 1. Rockabilly, USA, 60ties 2. Seasonal, rich oldies, xmas 3. February 08 love, oldies, 60 musik Tag support value: + increases + player selects the group -decreases - p. doesn’t select the group - player rules out the tag Wrong and correct tags bubble out Possitive and negative thresholds

CityLights: experiments LastFM datasets 875 games, 4933 questions, 1492 tags Feedback actions per tag: ◦ implicit ◦ 5.29 explicit Optimized parameter configuration ◦ 68% correctness, 51% confidence ◦ no false negatives

Competence through confidence Betting mechanism within a GWAP Through bet height, the player expresses his confidence CityLights case: bet height aligns with impact on support value Good for new players, about which no confidence model is yet known

Harnessing manpower for creating semantics

References 1. J. A. Gulla and V. Sugumaran. Aninteractive ontology learning workbench for non- experts. In Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web, ONISW ’08, pages 9–16, New York, NY, USA, ACM. 2. K. Maleewong, C. Anutariya, and V. Wuwongse. A semantic argumentation approach to collaborative ontology engineering. In Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services, iiWAS ’09, pages 56–63, New York, NY, USA, ACM. 3. L. Mcdowell and M. Cafarella. Ontology-driven, unsupervised instance population. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):218–236, Sept M. Jačala and J. Tvarožek. Named entity disambiguation based on explicit semantics. In Proc. of the 38 th int. conf. on Current Trends in Theory and Practice of Computer Science, SOFSEM’12, pages 456–466, Berlin, Heidelberg, Springer-Verlag. 5. M. Sabou, K. Bontcheva, and A. Scharl. Crowdsourcing research opportunities: lessons from natural language processing. In Proceedings of the 12th International Conference on Knowledge Management andKnowledge Technologies, i-KNOW ’12, pages 17:1– 17:8, New York, NY, USA, ACM. 6. A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pages 1403–1412, New York, NY, USA, ACM. 7. L. von Ahn and L. Dabbish. Designing games with a purpose. Commun. ACM, 51(8):58– 67, 2008.

Selected publications Šimko, Jakub - Tvarožek, Michal - Bieliková, Mária: Semantics Discovery via Human Computation Games. In: International Journal on Semantic Web and Information Systems. - ISSN Vol. 7, No. 3 (2011), s Šimko, J., Tvarožek, M., Bieliková, M. Human Computation: Single-player Annotation Game for Image Metadata. International Journal on Human- Computer Studies. [accepted]. Dulačka, Peter - Šimko, Jakub - Bieliková, Mária: Validation of Music Metadata via Game with a Purpose. In: I-Semantics 2012 Proceedings of the 8th International Conference on Semantic Systems 5th - 7th of September 2012Graz, Austria. - New York : ACM, ISBN S Šimko, Jakub - Bieliková, Mária: Games with a Purpose: User Generated Valid Metadata for Personal Archives. In: SMAP 2011 : Proceedings of Sixth International Workshop on Semantic Media Adaptation and Personalization SMAP 2011, 1-2 December 2011 Vigo, Pontevedra, Spain. - Los Alamitos : IEEE Computer Society, ISBN S Šimko, Jakub - Tvarožek, Michal - Bieliková, Mária: Little Search Game: Term Network Acquisition via a Human Computation Game. In: HT 2011 : Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia June 6-9, 2011 Eindhoven, The Netherlands. - New York : ACM, ISBN S Šimko, Jakub - Bieliková, Mária: Personal Image Tagging: a Game-based Approach. In: I-Semantics 2012 Proceedings of the 8th International Conference on Semantic Systems 5th - 7th of September 2012Graz, Austria. - New York : ACM, ISBN S

LSG evaluation: relationship types Data 400 nodes, 560 edges ConceptNet lightweight dataset Method Identify relationship types – A posteriori (2 judges) – Reference dataset Results Not all LSG relationships were present in ConceptNet Dominant rel. types: – Unlabelled,hasProperty, hasA, atLocation

TermBlaster: towards specific domain Specific domain No text typing Experiments: 38 players 732 rounds 6 task terms, 15 relationships each 71 % correct, 21% „hidden relationships“