Querying RDF Data with Text Annotated Graphs Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng SSDBM’15  2015-07-01

Slides:



Advertisements
Similar presentations
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Advertisements

SWIMs: From Structured Summaries to Integrated Knowledge Base
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Hermes: News Personalization Using Semantic Web Technologies
Chapter 3 Querying RDF stores with SPARQL. TL;DR We will want to query large RDF datasets, e.g. LOD SPARQL is the SQL of RDF SPARQL is a language to query.
Leveraging Community-built Knowledge For Type Coercion In Question Answering Aditya Kalyanpur, J William Murdock, James Fan and Chris Welty Mehdi AllahyariSpring.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Search Engines and Information Retrieval
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Chapter 5: Information Retrieval and Web Search
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
LOD 123: Making the semantic web easier to use Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Search Engines and Information Retrieval Chapter 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
SemSearch: A Search Engine for the Semantic Web Yuangui Lei, Victoria Uren, Enrico Motta Knowledge Media Institute The Open University EKAW 2006 Presented.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Chapter 6: Information Retrieval and Web Search
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
The Semantic Web: there and back again
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Using linked data to interpret tables Varish Mulwad September 14,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Chapter 3 Querying RDF stores with SPARQL
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
And the Watson Plugin for the NeOn Toolkit. IST NeOn-project.org The Semantic Web is growing… #SW Pages.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Linked Data for the Rest of Us Tim Finin, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 12 January 2012
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Of 24 lecture 11: ontology – mediation, merging & aligning.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
Web IR: Recent Trends; Future of Web Search
Wikitology Wikipedia as an Ontology
Extracting Semantic Concept Relations
Property consolidation for entity browsing
DBpedia 2014 Liang Zheng 9.22.
CS246: Information Retrieval
Chaitali Gupta, Madhusudhan Govindaraju
Template-based Question Answering over RDF Data
Presentation transcript:

Querying RDF Data with Text Annotated Graphs Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng SSDBM’15 

Who wrote Tom Sawyer?

Things, not Strings Companies moving from information retrieval to questions answering Need some NLP (e.g. named entity recognition) and good knowledge bases Google is a good example – 2014: Freebase: 1.2B facts about 43M entities – 2015: Google knowledge graph, updated by text IE Others have big KBs: Microsoft (Satori), IBM (Watson), Apple, Wolfram (Alpha), Facebook (Open Graph) DBpedia is an open source KB in RDF that’s based on Wikipedia

Semantic web technologies allow machines to share data and knowledge using common web language and protocols. ~ Linked Open Data Cloud : 31B facts in 295 datasets interlinked by 504M assertions on ckan.netckan.net LOD is the new Cyc: a common source of background knowledge Uses Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge

Linked Open Data Cloud B facts in 1000 datasets using 3000 vocabu- laries with 100K classes and 60K properties

DBpedia as Wikipedia LOD DBpedia is an important LOD example – Structured data from Infoboxes in Wikipedia + raw Wikipedia data – RDF in custom ontologies & others, e.g., Yago terms Major integration point for LOD cloud Datasets for 22 languages English DBPedia has: – ~400M triples – 65M facts (e.g., Alan_Turing born ) about 3.8M entities (person, place, organization…)

LOD is Hard for People to Query Querying LOD requires a lot of a user – Understand RDF model – Master SPARQL, a formal query language – Explore large number of ontology terms, hundred classes and thousand properties – Deal with term heterogeneity (Place vs. PopulatedPlace) – Know relevant URIs (~3M entities !) Querying a large LOD is overwhelming Natural language query systems still a research goal

Goal Develop a system allowing a user with a basic understanding of RDF to query DBpedia and ultimately distributed LOD collections – To explore what data is in the system – To get answers to questions – To create SPARQL queries for reuse or adaptation Desiderata – Easy to learn and to use – Good accuracy, e.g., precision and recall – Fast

Related Work Natural Language Interfaces (NLI) systems – ORAKEL user interactions, small, closed-domain ontologies – FREyA user interactions – PANTO automatic, small, closed-domain ontologies – PowerAqua automatic, open domain, no SPARQL All have difficulties in understanding complex questions and rely on lexical matching to find candidate terms Keyword interface (and Watson?) – No support for complex queries

Key Idea Reduce problem complexity by having user: – Enter a simple graph, and – Annotate the nodes and arcs with words and phrases

Schema-Agnostic Query Interface Nodes denote entities and links binary relations between them Entities described by two freely chosen phrases: its name or value and its concept in the query context A ? marks output entities, a *marks ones to ignore Compromise between NLI and SPARQL – Users provide compositional structure of question – Freedom to use any phrases in annotating structure Where was the author of the Adventures of Tom Sawyer born?

Instance data vs. Schema data We don’t exploit schema axioms (Actor ⊆ Person) Two key datasets – Relation dataset: all relations between instances – Type dataset: all type definitions for instances Integrate RDF literal data types into five that are familiar to users – ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral – ˆLiteral is the super type of the other four

Automatically enrich set of types Automatically deduce types from relations – Infer attribute types from data type properties, population, “ ” => ˆPopulation – Infer classes from object properties, director, => ˜Director 15

Counting Co-occurrence

Concept Association Knowledge Measured by directed PMI value

Translation – Step One finding semantically similar ontology terms For each concept or relation in semantic graph, generate k most semantically similar candidate ontology classes or properties

Semantic similarity of words Words occurring in the same context are similar doctor/physician > doctor/nurse > doctor/lawyer > doctor/sandwich 3B word corpus Stanford WebBase crawl, cleaned, POS tagged, lemmatized Vocabulary: 29K terms – Content words (noun, verb, adjective, adverb) occurring frequently + some phrases ± 4 widow is program_VB in both java_NN and python_NN and run_VB

SVD to Reduce dimensionality LSA Similarity – 29k x 29k POS tagged term co-occurrence matrix – Replace frequency counts by logs – Perform SVD, retaining retain 300 largest singular values – Semantic similarity = vector cosine similarity Add knowledge from WordNet – Address polysemy issue of LSA similarity – Boost LSA score using synset, hypernym, derivation and other relations

TOEFL Synonym Evaluation Our LSA model is better on some tasks than Google’s word2vec TOEFL synonym task: pick best synonym from a list of four for 80 words ± 1 model± 4 modelword2vec Correctly answered OOV wordshalfheartedly tranquility Accuracy92.4%96.2%84.8% provision stipulation interrelation jurisdiction interpretation

From word to text similarity Basic align and penalize approach ① align words to maximize word similarity ② compute average word similarity for pairs ③ penalize unaligned terms, known antonyms Preprocessing: POS tag, lemmatization, REs to identify number and dates, stopword removal Word similarity wrapper for numbers, time expressions, pronouns and OOV words Penatly component for antonyms, etc.

A&P align and penalize example align 1 => 2align 2 => 1 Allow multiple words to align with one

Interpretation goodness metric is degree to which its ontology terms associate like corresponding user terms connect in SAQ Translation – Step Two disambiguation algorithm Joint disambiguation Resolve direction Compute reasonableness for a single link

Translation of semantic graph query to SPARQL is straightforward given mappings SPARQL Generation Concepts Place => Place Author => Writer Book => Book Relations born in => birthPlace wrote => author PREFIX dbo: SELECT DISTINCT ?x, ?y WHERE { ?0 a dbo:Book; rdfs:label ?label0. ?label0 bif:contains '"Tom Sawyer"'. ?x a dbo:Writer. ?y a dbo:Place. {?0 dbo:author ?x}. {?x dbo:birthPlace ?y}.} PREFIX dbo: SELECT DISTINCT ?x, ?y WHERE { ?0 a dbo:Book; rdfs:label ?label0. ?label0 bif:contains '"Tom Sawyer"'. ?x a dbo:Writer. ?y a dbo:Place. {?0 dbo:author ?x}. {?x dbo:birthPlace ?y}.}

Evaluation 30 test questions from 2011 Workshop on Question Answering over Linked Data answerable using DBpedia Three human subjects unfamiliar with DBpedia translated the test questions into SAQ queries Compare system to FREyA and PowerAqua systems that both require human-crafted domain knowledge. FREyA also requires user to resolve ambiguity.

Other Evaluations DBLP Database Evaluation – Generate SQL to answer questions on DBLP data, – Which UCSD authors from have papers in CIKM? 2013 and 2014 SemEval tasks – Algorithms to measure semantic similarity of short text sequences => top scorer overall 2015 SAQ Challenge Evaluation – Top system in Schema-Agnostic Query Evaluation Challenge at 2015 Extended Semantic Web Conf.

Conclusion and Future Work Baseline system works well for DBpedia and DBLP KBs Ongoing and future work – Better Web interface – Allow user feedback and advice – Add entity matching (which George Bush?) – Extend and scale to a distributed LOD collection For more information, see