Computing FOAF Co-reference Relations with Rules and Machine Learning Jennifer Sleeman and Tim Finin University of Maryland, Baltimore County The Third.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
LINKED DATA COMS E6125 Prof. Gail Kaiser Presented By : Mandar Mohe ( msm2181 )
U of R eXtensible Catalog Team MetaCat. Problem Domain.
RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Semantic Web Series 1 Mohammad M. R. Cowdhury UniK, Kjeller.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.
Presenter: Shanshan Lu 03/04/2010
Developing “Geo” Ontology Layers for Web Query Faculty of Design & Technology Conference David George, Department of Computing.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Reputation Network Analysis for Filtering Ravi Emani Ramesh Ravindran.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Introduction to the Semantic Web and Linked Data
Using linked data to interpret tables Varish Mulwad September 14,
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Dr. Lowell Vizenor Ontology and Semantic Technology Practice Lead Alion Science and Technology Semantic Technology: A Basic Introduction.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Microsoft Research Faculty Summit Jennifer Golbeck Assistant Professor, College of Information Studies University of Maryland, College Park Social.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Learning Co-reference Relations for FOAF Instances Jennifer Sleeman and Tim Finin, University of Maryland, Baltimore County Motivation Establishing co-reference.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Analyzing and Securing Social Networks
Wikitology Wikipedia as an Ontology
An Interactive Approach to Collectively Resolving URI Coreference
Information Networks: State of the Art
Presentation transcript:

Computing FOAF Co-reference Relations with Rules and Machine Learning Jennifer Sleeman and Tim Finin University of Maryland, Baltimore County The Third International Workshop on Social Data on the Web, November

FOAF Friend of a Friend (FOAF) vocabulary describes people and their relationships  One of oldest and most widely used ontologies Does not include a globally unique identifier  Inverse functional properties (IFPs) help Multiple foaf instances referring to the same person are common  Increasingly so with more linked data introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Linking data Data integration requires linking instances from different data sets Linking foaf instances is a common and typical use case Sindice reports 23 foaf instances all referring to Sir Tim Berners Lee23  Probably more than my query revealed  Only a handful are linked via owl:sameAs  Automatically linking foaf instances is not always easy introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Example 1 Bijan Parsia Bijan Bijan Parsia Parsia f49a c5fa76dc0edb8e82f8fe04fd56bc9 Bijan Parsia Bijan Parsia bparsia bparsia Common properties but can we say this is the same person…

Example 2 James A. Hendler James Hendler 0b62d e64be c79a811b3e82fd52 Jim Hendler Jim Hendler Tetherless World Constellation Chair jhendler jhendler Aliases and slight name variations…

Example a31a78661b5c746feff39a9db6e4e2cc5cf David Wood dw2 David Wood Dr. David Wood prototypo 37c8d030d4e615d05f31625b a3f4e214e piprototypo What if mbox_sha1sums are different?

Example 3 cont. David Wood Which David Wood was a mindswapper?

Example a31a78661b5c746feff39a9db6e4e2cc5cf jgolbeck jgolbeck Jennifer Golbeck Jennifer Jennifer Golbeck Golbeck Could jgolbeck and Jennifer Golbeck be the same person …

Example 5 cont. Jennifer Golbeck Jennifer Golbeck Which profile is most recent/relevant?

Our Contributions Treating foaf smushing as entity co-reference Use machine learning to train a classifier for recognizing co-referent foaf instance Combine this with rule-based evidence Use of narrower RDF properties to express co- reference, avoiding overuse of owl:sameAs Use of a greedy algorithm for iteratively clustering co-referent entities and re-evaluating their potential co-reference relations introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Co-Reference in FOAF Approach problem like cross-document co- reference resolution in text Match pairs FOAF agents Use rules and properties Assign new properties to represent coref and notCoref relationships Cluster co-referent pairs  introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Cross-Document Co-reference Resolution Determine when two documents mention the same entity Are two documents that talk about “George Bush” talking about the same George Bush? Is a document mentioning “Mahmoud Abbas” referring to the same person as one mentioning “Muhammed Abbas”? What about “Abu Abbas”? “Abu Mazen”? Drawing appropriate inferences from multiple documents demands cross- document co-reference resolution 2008 NIST Text Analysis Conference

TAC KBP: Entity Linking John Williams Richard Kaufman goes a long way back with John Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in John Williamsauthor J. Lloyd Williamsbotanist John Williamspolitician1955- John J. WilliamsUS Senator John WilliamsArchbishop John Williamscomposer1932- Jonathan Williamspoet1929- Michael Phelps Debbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir,... Michael Phelpsswimmer1985- Michael Phelpsbiophysicist1939- Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has... Given an entity mention in an article, find the link to the right Wikipedia entity if one exists NIST TAC Knowledge Base Population Track

Smushing Smushing is the traditional term used for recognizing that two “blank nodes” refer to the same thing and merging them Smushing Past work on smushing has exploited IFPs (e.g., foaf:mbox), heuristic similarity metrics and custom SPARQL queries owl:sameAs is often used to relate smushed nodes, enabling a reasoner to effect the merging rdf:seeAlso used to find related foaf data introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Smushing introduction  foaf co-reference  approach  methodology  evaluation  conclusions foaf: Person rdfs:type foaf:mbox foaf:knows foaf:nick ”bar" owl:sameAs foaf:mbox

Smushing introduction  foaf co-reference  approach  methodology  evaluation  conclusions foaf: Person rdfs:type foaf:knows foaf:nick ”bar" foaf:mbox

owl:sameAs considered harmful Known problems – Temporally qualified data (Ding vs. Ding)Ding vs. Ding – Noisy data (Clinton vs. Clinton)Clinton vs. Clinton – Referentially opaque contexts (John likes the Morning Star beautiful) Referentially opaque contexts Halpin et. Al (2010) suggest a vocabulary for similarity relations similarity.owlsimilarity.owl We use two weaker predicates: coref & notCoref – Defer the sameAs problem to applications introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Co-Reference in FOAF coref: transitive, symmetric and reflexive; has sameAs as subproperty notCoref: symmetric and irreflexive but not transitive; has differentFrom as subproperty :coref a owl:TransitiveProperty, owl:SymmetricProperty, owl:ReflexiveProperty owl:sameAs rdfs:subPropertyOf :coref. :notCoref a owl:SymmetricProperty, owl:IrreflexiveProperty. owl:differentFrom rdfs:subPropertyOf :notCoref. {?a :notCoref ?b. ?b :coref ?c.} => {?a :notCoref ?c} {?a foaf:knows ?b.} => {?a :notCoref ?b} The :coref and :notCoref properties that we use instead of owl:sameAs introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Batch Approach Given a potentially large set of foaf instances Generate candidate pairs Evaluate each pair for co-reference Using rules and classifier independently Each results in a {coref, notCoref, unknown} decision Trust rules over classifier Designate pairs as co-referent Create Clusters introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Ingest Extract triples from FOAF profiles Add each foaf agent as new entity in database Entity URLs followed in foaf:knows graph to get additional information introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Approach: System Architecture introduction  foaf co-reference  approach  methodology  evaluation  conclusions ingestion candidate pair generation candidate pair generation rule-based reasoning rule-based reasoning machine learning machine learning Model Generation Abstract entity generation Potential pairs: reduces classifier workload deductive decisions deductive decisions predictions clusters form new abstract entities Co-referent designation and clustering

Candidate Pairs Filter pairs reduce matching set Use simple string matching predicates Dice score for 3-grams Apply both to values of common properties and also cross-property values Experiment 2 ~30% reduction Reductions vary based on data set introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Input data sources FOAF profiles extracted from Swoogle Also used URLS extracted from tests conducted in previous work Distribution of URLs from Experiment 2 introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Methodology: Rule-based Model Rules conclude that two instances are co- referent, not co-referent or draw no conclusion (the most common outcome) Basic co-reference rule: {?p a owl:IFP. ?a ?p ?x. ?b ?p ?x) => {?a :coref ?b} {?p a owl:FP. ?a ?p ?x. ?a ?p ?y.) => { ?x :coref ?y} introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Methodology: Rule-based Model In text processing, very similar name mentions in a document more likely to be co-referent It also is used in disambiguating name men- tions in citations in a single paper or Web page A similar heuristic is useful for a “knows graph” extracted from a single foaf profile {?a foaf:knows ?b. ?a foaf:knows ?c. ?b neq ?c} => {?b :notCoref ?c} introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Methodology – Vector Model Support Vector Machine linear kernel Features: – Match/nomatch of any IFPs – Distance measures over common property values (Levenshtein & 3-gram Dice score) – Alias and entity mention resolution – Property specific feature comparison – Knows graph comparisons: Jaccard coef of similarity of foaf names of one-hop neighbors introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Methodology: Clustering Pairs form clusters Clusters used as part of system evaluation Can result in: – Entity to Entity pairing – Cluster to Entity pairing – Cluster to Cluster pairing Greedy process with a confidence threshold Use rule-based model to eliminate known non-coreferent pairs introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Methodology – Clustering Instance matching can result in new cluster formation and cluster matching can result in merged clusters. introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Evaluation Two experiments – E1: 50,000 triples, over 500 entity mentions, 600 classes used for training – E2: 250,000 triples, over 3500 entity mentions, over 1800 classes for training 10-fold cross-validation tests introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Evaluation Pairs Rule Conclusion differentFromUndetermined 47184inverse functionalUndetermined 2402inverse functionalCo-referent knows graphUndetermined sameAsUndetermined knowsNot Co-referent For E1: 900 pairs non-match, majority undetermined E2: Results shown below introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Evaluation Results promising During our E2 clustering phase, the first phase 90% accuracy Second phase no new relationships among pairs, cluster to cluster pairing occurred Classification Results using 10-fold Validation introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Evaluation Retrieving additional FOAF profiles based on knows graph Quickly retrieve large number of entities Tightly linked – reduced diversity of analyzed data – more entities that are co-referent Future experiments: a diversity filter spanning domains introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Future Work Evaluating the contribution of each rule and SVM feature to performance Other ML approaches, e.g., markov logic, EM Exploiting better clustering algorithms Adding more features, e.g. non-foaf vocabu- lary, non-RDF data (e.g., hosting site) Applying approach to other RDF instances Scalability: Providing a non-batch, streaming service Offering a coref Web service introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Conclusions We can treat instance linking as co-reference resolution & exploit in-doc and xdoc distinction Good results with an ensemble approach combining rules and an SVM classifier Apply clustering to form groups of co-referent relations and reprocess Promising initial results introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Introduction Determine if two entities are co-referent Co-reference common in human language, database address records, and even official government records Use of name and other properties evidence of co-reference Merging information More complete representation introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Introduction Semantic Web Co-referent relations Experimented using an ensemble approach Used both RDF and OWL rules and a Support Vector Machine to classify pairs of FOAF instances Introduced a new coref and notCoref predicate to convey co-referent relationships among relations introduction  foaf co-reference  approach  methodology  evaluation  conclusions

Example 4 Richard Stallman rms ed575e3a6d0c0de f8308f72ecd 1f3af991f2b053b34178c093fe96725cfa Richard Stallman I'm the president of the Free Software Foundation. … rms … mbox_sha1sums are different but this is the same person…

Example 6 Yarden Katz jordan milesdavis Jordan Katz jordan acba421f117a4b32dbb14eb1971f6a21d9f17deb Names are different but nicknames are the same…

Methodology: input data introduction  foaf co-reference  approach  methodology  evaluation  conclusions FOAF property distribution from earlier data (March 2010)