Creating and Exploiting a Web of (Semantic) Data, Tim Finin Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
SWIMs: From Structured Summaries to Integrated Knowledge Base
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Wikitology Wikipedia as an Ontology Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Semantics for Big Data (,) Security and Privacy Tim Finin and Anupam Joshi University of Maryland, Baltimore County Baltimore MD NSF Workshop on Big Data.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Semantic Web Technologies ufiekg-20-2 | data, schemas & applications | lecture 21 original presentation by: Dr Rob Stephens
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
1 The BT Digital Library A case study in intelligent content management Paul Warren
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Creating and Exploiting a Web of Semantic Data Tim Finin University of Maryland, Baltimore County joint work with Zareen Syed (UMBC) and colleagues at.
Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County
Logics for Data and Knowledge Representation
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Semantics for Cybersecurity and Privacy Tim Finin, UMBC Joint work with Anupam Joshi, Karuna Joshi, Zareen Syed andmany UMBC graduate students
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
You sexy beast. Ok, inappropriate. How about: Web of links to Web of Meaning Hello Semantic Web!
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Artificial Intelligence 2004 Ontology
Introduction to the Semantic Web and Linked Data
Using linked data to interpret tables Varish Mulwad September 14,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Creating and Exploiting a Web of Semantic Data. Overview Introduction Semantic Web 101 Recent Semantic Web trends Examples: DBpedia, Wikitology Conclusion.
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
Creating and Exploiting a Web of Semantic Data Tim Finin, UMBC Earth and Space Science Informatics Workshop 05 August 2009
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen * (Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
Information Sharing on the Social Semantic Web Aman Shakya* and Hideaki Takeda National Institute of Informatics, Tokyo, Japan The Second NEA-JC Workshop.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Introduction to the Semantic Web. Questions What is the Semantic Web? Why do we want it? How will we do it? Who will do it? When will it be done?
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Big Data Quality the next semantic challenge
Wikitology Wikipedia as an Ontology
Creating and Exploiting a Web of Semantic Data
Extracting Semantic Concept Relations
Big Data Quality the next semantic challenge
DBpedia 2014 Liang Zheng 9.22.
Big Data Quality the next semantic challenge
Presentation transcript:

Creating and Exploiting a Web of (Semantic) Data, Tim Finin Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko JHU Human Language Technology Center of Excellence

Overview Introduction Recent Semantic Web trends Leveraging Linked Data on the Web Applications to human language understanding Conclusion

The Age of Big Data Massive amounts of data is available today Advances in human language processing have been driven by the availability of unstructured data, text and speech Increasingly, large amounts of structured and semi-structured data is also online Much of this available in the Semantic Web language RDF, fostering integration We can exploit this data to enhance language understanding systems

Twenty years ago… Tim Berners-Lee’s 1989 WWW proposal described a web of relationships among named objects unifying many info. management tasks. Capsule history Guha’s MCF (~94) XML+MCF=>RDF (~96) RDF+OO=>RDFS (~99) RDFS+KR=>DAML+OIL (00) W3C’s SW activity (01) W3C’s OWL (03) SPARQL, RDFa (08)

Ten years ago …. The W3C started developing standards for the Semantic Web The vision, technology and use cases are still evolving Moving from a web of documents to a web of data

Web of documents

Web of (Linked) Data

One month ago …. 4.5 billion facts in RDF in the Linked Data Collection

A Linked Data story Wikipedia as a source of knowledge – Wikis are a great ways to collaborate on building up knowledge resources Wikipedia as an ontology – Every Wikipedia page is a concept or object Wikipedia as RDF data – Map this ontology into RDF DBpedia as the lynchpin for Linked Data – Exploit its breadth of coverage to integrate things

Populating Freebase KB

Underlying Powerset’s KB

Mined by TrueKnowledge

Wikipedia as an ontology Using Wikipedia as an ontology – each article (~3M) is an ontology concept or instance – terms linked via category system (~200k), infobox template use, inter-article links, infobox links – Article history contains metadata for trust, provenance, etc. It’s a consensus ontology with broad coverage Created and maintained by a diverse community for free! Multilingual Very current Overall content quality is high

Wikipedia as an ontology Uncategorized and miscategorized articles Many ‘administrative’ categories: articles needing revision; useless ones: 1949 births Multiple infobox templates for the same class Multiple infobox attribute names for same property No datatypes or domains for infobox attribute values etc.

4.5 billion triples for free The full public LOD dataset has about 4.5 billion triples as of March 2009 Linking assertions are spotty, but probably include order 10M equivalences Availability: – download the data in RDF – Query it via a public SPARQL servers – load it as an Amazon EC2 public dataset – Launch it and required software as an Amazon public AMI image

Wikitology We’ve been exploring a different approach to derive an ontology from Wikipedia through a series of use cases: – Identifying user context in a collaboration system from documents viewed (2006) – Improve IR accuracy by adding Wikitology tags to documents (2007) – ACE: cross document co-reference resolution for named entities in text (2008) – TAC KBP: Knowledge Base population from text (2009) – Improve Web search engine by tagging documents and queries (2009)

Wikitology 2.0 (2008) WordNet Yago Human input & editingDatabases Freebase KB RDF textgraphs

ACE entity co-reference task In 2008 we participated in the NIST ACE task with the JHU Human Language Technology Center of Excellence Given 10K English and 10K Arabic documents, find all ‘named entities’ (people, organizations) Cluster into sets that refer to the same entity – “Dr. Rice” mentioned in doc is the same as “Secretary of State” in doc – Distinguish Michael Jordan of the Bulls from Michael Jordan of Berkeley

HLTCOE ACE approach BBN’s Serif system produces text annotated with named entities (people or organizations) Dr. Rice, Ms. Rice, the secretary, she, secretary Rice Featurizers score pairs of entities for co-reference (CNN E32, AFP E19, ) A machine learning system combines the evidence A simple clustering algorithm identifies clusters NLP ML clust FEAT Documents KB entities

Wikitology tagging Using BBN’s Serif system, we produced an entity document for each entity. Included the entity’s name, nominal and pronominal mentions, APF type and subtype, and words in a window around the mentions We tagged entity documents using Wiki- tology producing vectors of (1) terms and (2) categories for the entity We used the vectors to compute features measuring entity pair similarity/dissimilarity

Wikitology Entity Document & Tags Wikitology entity document ABC LDC2000T44-E2 Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" PRO: "he” "him” "his" abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Wikitology article tag vector Webster_Hubbell Hubbell_Trading_Post National Historic Site United_States_v._Hubbell Hubbell_Center Whitewater_controversy Wikitology category tag vector Clinton_administration_controversies American_political_scandals Living_people _births People_from_Arkansas Arkansas_politicians American_tax_evaders Arkansas_lawyers Name Type & subtype Mention heads Words surrounding mentions

Top ten features (by F1) Prec. RecallF1 Feature Description 90.8% 76.6%83.1% some NAM mention has an exact match 92.9% 71.6%80.9% Dice score of NAM strings (based on the intersection of NAM strings, not words or n-grams of NAM strings) 95.1% 65.0%77.2% the/a longest NAM mention is an exact match 86.9% 66.2%75.1% Similarity based on cosine similarity of Wikitology Article Medium article tag vector 86.1% 65.4%74.3% Similarity based on cosine similarity of Wikitology Article Long article tag vector 64.8% 82.9%72.8% Dice score of character bigrams from the 'longest' NAM string 95.9% 56.2%70.9% all NAM mentions have an exact match in the other pair 85.3% 52.5%65.0% Similarity based on a match of entities' top Wikitology article tag 85.3% 52.3%64.8% Similarity based on a match of entities' top Wikitology article tag 85.7% 32.9%47.5% Pair has a known alias

Knowledge Base Population The 2009 NIST Text Analysis Conference (TAC) includes a Knowledge Base Population track Goal: discover information about named entities (people, organizations, places) and incorporate it into a KB TAC KBP has two related tasks: – Entity linking: doc. entity mention -> KB entity – Slot filling: given a document entity mention, find missing slot values in large corpus

Wikitology Planned Extensions Make greater use of data from Linked Open Data (LOD) resources: DBpedia, Geonames, Freebase Replace ad hoc processing of RDF data in Lucene with a triple store Add additional graphs (e.g., derived from infobox links and develop algorithms to exploit them Develop a better hybrid query creation tools

Infobox Graph Infobox Graph IR collection Relational Database Relational Database Triple Store RDF reasoner Page Link Graph Category Links Graph Category Links Graph Articles Wikitology Code Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Wikitology 3.0 (2009) Linked Semantic Web data & ontologies Infobox Graph Infobox Graph

Wikipedia’s social network Wikipedia has an implicit ‘social network’ that can help disambiguate mentions of a person entity This provides evidence useful in disambiguating entity mentions and mapping them to known KB entities The same can be done for entities that are organizations or places

WSN Data We extracted 213K people from the DBpedia’s Infobox dataset, ~30K of which participate in an infobox link to another person We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links Consider a document that mentions two people: George Bush and Mr. Quayle

Which Bush & which Quayle? Six George BushesNine Male Quayles

A simple closeness metric Let N(i) = {two hop neighbors of entity i} Assoc(I,j) = |intersection(N(i),N(j))| / |union(N(i),N(j))| Assoc(I,J) > 0 for six of the 56 possible pairs 0.43 George_H._W._Bush -- Dan_Quayle 0.24 George_W._Bush -- Dan_Quayle 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle 0.02 George_Bush_(biblical_scholar) -- James_C._Quayle 0.02 George_H._W._Bush -- Anthony_Quayle 0.01 George_H._W._Bush -- James_C._Quayle

Application to TAC KBP Using entity network data extracted from Dbpedia and Wikipedia provides evidence to support KBP tasks: – Mapping document mentions into infobox entities – Mapping potential slot fillers into infobox entities – Evaluating the coherence of entities as potential slot fillers

Conclusion Wikipedia is increasingly being used as a knowledge source of choice Useful KBs can be extracted from it and related resources (e.g., Dbpedia, Freebase) Linked Open Data significantly enriches the KBs Hybrid systems like Wikitology combining IR, RDF, and custom graph algorithms are promising Wikitology performed well in the 2008 ACE task The 2009 TAC KBP task will be a good evaluation opportunity