Wikitology Wikipedia as an Ontology Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009 Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County
Wikitology Wikipedia as an Ontology Zareen Syed, Tim Finin and Anupam Joshi University of Maryland.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Using linked data to interpret tables Varish Mulwad September 14,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Creating and Exploiting a Web of Semantic Data. Overview Introduction Semantic Web 101 Recent Semantic Web trends Examples: DBpedia, Wikitology Conclusion.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Creating and Exploiting a Web of Semantic Data Tim Finin, UMBC Earth and Space Science Informatics Workshop 05 August 2009
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.
Creating and Exploiting a Web of (Semantic) Data, Tim Finin Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul.
An Ontological Approach to Financial Analysis and Monitoring.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Social Knowledge Mining
Wikitology Wikipedia as an Ontology
Information Retrieval
Creating and Exploiting a Web of Semantic Data
Extracting Semantic Concept Relations
Presented by: Prof. Ali Jaoua
Introduction to Information Retrieval
Presentation transcript:

Wikitology Wikipedia as an Ontology Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko JHU Human Language Technology Center of Excellence Tim Finin, UMBC

Overview Introduction Wikipedia as an ontology Applications Discussion Conclusion introduction  wikitology  applications  discussion  conclusion

Wikis and Knowledge Wikis are a great way to collaborate on knowledge encoding –Wikipedia is an archetype for this, but there are many examples Ongoing research is exploring how to integrate this with structured knowledge –DBpedia, Semantic Media Wiki, Freebase, etc. I’ll describe an approach we’ve taken and experiments in using it –We came at this from an IR/HLT perspective introduction  wikitology  applications  discussion  conclusion

Wikipedia data in RDF introduction  wikitology  applications  discussion  conclusion

Populating Freebase KB introduction  wikitology  applications  discussion  conclusion

Populating Powerset’s KB introduction  wikitology  applications  discussion  conclusion

AskWiki uses Wikipedia for QA introduction  wikitology  applications  discussion  conclusion

With sometimes surprising results introduction  wikitology  applications  discussion  conclusion

TrueKnowledge mines Wikipedia introduction  wikitology  applications  discussion  conclusion

Wikipedia pages as tags introduction  wikitology  applications  discussion  conclusion

Wikitology We are exploring an approach to deriving an ontology from Wikipedia that is useful in a variety of language processing tasks introduction  wikitology  applications  discussion  conclusion

Our original problem (2006) Problem: describe what an analyst has been working on to support collaboration Idea: track documents she reads and map these to terms in an ontology, aggregate to produce a short list of topics Approach: use Wikipedia articles as ontology terms, use document-article similarity for the mapping, and spreading activation for aggregation introduction  wikitology  applications  discussion  conclusion

What’s a document about? Two common approaches: (1) Select words and phrases using TF- IDF that characterize the document (2) Map document to a list of terms from a controlled vocabulary or ontology (1) is flexible and does not require creating and maintaining an ontology (2) can tie documents to a rich knowledge base introduction  wikitology  applications  discussion  conclusion

Wikitology ! Using Wikipedia as an ontology offers the best of both approaches –each article (~3M) is a concept in the ontology –terms linked via Wikipedia’s category system (~200k) and inter-article links –Lots of structured and semi-structured data It’s a consensus ontology created and maintained by a diverse community Broad coverage, multilingual, very current Overall content quality is high introduction  wikitology  applications  discussion  conclusion

Wikitology features Terms have unique IDs (URLs) and are “self describing” for people Underlying graphs provide structure and associations: categories, article links, disambiguation, aliases (redirects), … Article history contains useful meta-data for trust, provenance, controversy, … External sources provide more info (e.g., Google’s PageRank) Annotated with structured data from DBpedia, Freebase, Geonames & LOD introduction  wikitology  applications  discussion  conclusion

Problems as an Ontology Treating Wikipedia as an ontology reveals many problems Uncategorized and miscategorized articles Single document in too many categories: –George W. Bush is included in about 30 categories Links between articles belonging to very different categories –John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points introduction  wikitology  applications  discussion  conclusion

Problems as an Ontology Article links in text are not “typed” Uneven category articulation –Some categories are under represented where as others have many articles Administrative categories, e.g. –Clean up from Sep 2006 –Articles with unsourced statements Over-linking, e.g. –A mention of United States linked to the page United_states –Mentions of 1949 linked to the year 1949 introduction  wikitology  applications  discussion  conclusion

Problems as an Ontology Wikipedia’s infobox templates have great potential for have several problems Multiple templates for same class Multiple attribute names for same property –E.g., six attributes for a person’s birth date Attributes lack domains or datatypes –E.g., value can be string or link introduction  wikitology  applications  discussion  conclusion

Wikitology 1, 2, 3 We’ve addressed some of of these problems in developing Wikitology The development has been driven by several use cases and applications introduction  wikitology  applications  discussion  conclusion

Wikitology Use Cases Identifying user context in a collaboration system from documents viewed (2006) Improve IR accuracy of by adding Wikitology tags to documents (2007) Cross document co-reference resolution for named entities in text (2008) Knowledge Base population from text (2009) Improve Web search engine by tagging documents and queries (2009) introduction  wikitology  applications  discussion  conclusion

Wikitology 1.0 (2007) Structured Data –Specialized concepts (article titles) –Generalized concepts (category titles) –Inter-category and -article links as relations between concepts –Article-category links as relations between specialized and generalized concepts Un-Structured Data –Article text Algorithms to remove useless categor- ies and links, infer categories, and select, rank and aggregate concepts using the hybrid knowledge base Human input & editing text graphs introduction  wikitology  applications  discussion  conclusion

Experiments Goal: given one or more documents, compute a ranked list of the top Wikipedia articles and/or categories that describe it. Basic metric: document similarity between Wikipedia article and document(s) Variations: role of categories, eliminating uninteresting articles, use of spreading activation, using similarity scores, weighing links, number of spreading activation pulses, individual or set of query documents, etc, etc. introduction  wikitology  applications  discussion  conclusion

Method 1 Query doc(s) similar to Cosine similarity Similar Wikipedia Articles Using Wikipedia article text & categories to predict concepts Input introduction  wikitology  applications  discussion  conclusion

Method 1 Query doc(s) similar to Cosine similarity Wikipedia Category Graph Similar Wikipedia Articles Input Using Wikipedia article text & categories to predict concepts introduction  wikitology  applications  discussion  conclusion

Method 1 Query doc(s) similar to Rank Categories 1.Links 2.Cosine similarity Cosine similarity Wikipedia Category Graph Similar Wikipedia Articles Input Output Using Wikipedia article text & categories to predict concepts introduction  wikitology  applications  discussion  conclusion

Method 2 Query doc(s) Similar to Cosine similarity Wikipedia Category Graph Using spreading activation on category link graph to get aggregated concepts Input Ranked Concepts based on Final Activation Score Output Spreading Activation Input Function Output Function introduction  wikitology  applications  discussion  conclusion

Method 3 Query doc(s) Similar To Ranked Concepts based on Final Activation Score Spreading Activation Threshold: Ignore Spreading Activation to articles with less than 0.4 Cosine similarity score Edge Weights: Cosine similarity between linked articles Wikipedia Article Links Graph Using spreading activation on article link graph Node Input Function Node Output Function Output Input

Evaluation An initial informal evaluation compared results against our own judgments Used to select promising combinations of ideas and parameter settings Formal evaluation: –Selected Wikipedia articles for testing; remove from Lucene index and graphs –For each, use methods to predict categories and linked articles –Compare results using precision and recall to known categories and linked articles introduction  wikitology  applications  discussion  conclusion

Method 1 Ranking Categories Directly Method 2 (2 pulses) Spreading Activation on Category links Graph Method 3 (2 pulses) Spreading Activation on Article Links Graph Agriculture Sustainable_technologies Crops Agronomy Permaculture Skills Applied_sciences Land_management Food_industry Agriculture Organic_farming Sustainable_agriculture Organic_gardening Agriculture Companion_planting Test Document Titles in the Set: (Wikipedia Articles) Crop_rotation Permaculture Beneficial_insects Neem Lady_Bird Principles_of_Organic_Agriculture Rhizobia Biointensive Inter­cropping Green_manure Example Prediction for Set of Test Documents Concept not in the Category Hierarchy

Category prediction evaluation Spreading activation with two pulses worked best Only considering articles with similarity > 0.5 was a good threshold introduction  wikitology  applications  discussion  conclusion

Article prediction evaluation Spreading activation with one pulse worked best Only considering articles with similarity > 0.5 was a good threshold introduction  wikitology  applications  discussion  conclusion

Improving IR performance ( ) Improving IR performance for a collection by adding semantic terms to documents Query with blind relevance feedback may benefit from the semantic terms Initial evaluation with NIST TREC 2005 collection in collaboration with Paul McNamee, JHU HLTCOE Ongoing: integration into RiverGlass MORAG search engine introduction  wikitology  applications  discussion  conclusion

Improving IR performance... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself... Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence Doc: FT (3/9/92) introduction  wikitology  applications  discussion  conclusion

Evaluation Mixed results on NIST evaluation Slightly worse on mean average precision Slightly better for precision at 10 base Base + rf Concepts + rf introduction  wikitology  applications  discussion  conclusion

Information Extraction Problem: resolve entities found by a named entity recognition system across documents to a KB entries ACE 2008: NIST run Automatic Extrac- tion Conference is focused on this task –We were part of a team lead by JHU Human Language Technology Center of Excellence –Use Wikitology to map document entities to KB entities introduction  wikitology  applications  discussion  conclusion

Wikitology 2.0 (2008) WordNet Yago Human input & editingDatabases Freebase KB RDF textgraphs

Named Entity Recognition Timothy F. Geithner, who as president of the New York Federal Reserve Bank oversaw many of the nation’s most powerful financial institutions, stunned the group with the audacity of his answer. He proposed asking Congress to give the president broad power to guarantee all the debt in the banking system, according to two participants, including Michele Davis, then an assistant Treasury secretary.

Named Entity Recognition Timothy F. Geithner, who as president of the New York Federal Reserve Bank oversaw many of the nation’s most powerful financial institutions, stunned the group with the audacity of his answer. He proposed asking Congress to give the president broad power to guarantee all the debt in the banking system, according to two participants, including Michele Davis, then an assistant Treasury secretary.

Open Calais Free NER service that returns results in RDF

Global Coreference Task Start with entities and relations produced by a within document extraction system –Produce ‘Global’ clusters for PERSON and ORGANIZATION entities –Only evaluate over instances of entities with a name Challenges: –Very limited development data ACE released 49 files in English, none in Arabic MITRE released English ACE05 corpus, but annotation is noisy and data has few ambiguous entities –Within document mistakes are propagated to cross-document system –10K document evaluation set required work on scalability of approaches William Wallace (living British Lord) William Wallace (of Braveheart fame) Abu Abbas aka Muhammad Zaydan aka Muhammad Abbas introduction  wikitology  applications  discussion  conclusion

Global Coreference Resolution Approach Serif for intra-document processing Entity Filtering –Collect all pairs of SERIF entities –Filter entity pairs with heuristics (e.g., string similarity of mentions) to get high- recall set of pairs significantly smaller than n 2 possible pairs Feature generation Training –Train SVM to identify coreferent pairs Entity Clustering –Cluster predicted pairs –Each connected component forms a global entity Relation Identification –Every pair of SERIF-identified relations whose types are identical and whose endpoints are coreferent are deemed to be coreferent Entity Clusters: Abu Mazen Mahmoud Abbas Muhammed Abbas Abu Abbas Palestinian Leader convicted terrorist Filtered Pairs: E1, E2 (shared word) E1, E3 (shared word) E2, E3 (known alias) Features: E1, E2: character overlap: 5 E1, E2: distinct Freebase entities: true E1, E3: character overlap: 3 E1, E3: distinct Freebase entities: false …. Document Entities: E2: Palestinian President Mahmoud Abbas... E1: Abu Abbas was arrested … E3: … election of Abu Mazen E4: … president George Bush introduction  wikitology  applications  discussion  conclusion

Wikitology tagging Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronom- inal mentions, APF type and subtype, and words in a window around the mentions We tagged entity documents using Wiki- tology producing vectors of (1) terms and (2) categories for the entity We used the vectors to compute fea- tures measuring entity pair similarity/dissimilarity introduction  wikitology  applications  discussion  conclusion

Entity Document & Tags ABC LDC2000T44-E2 Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" NAM: "Mr. " "friend” "income" PRO: "he” "him” "his",. abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years Wikitology article tag vector Webster_Hubbell Hubbell_Trading_Post National Historic Site United_States_v._Hubbell Hubbell_Center Whitewater_controversy Wikitology category tag vector Clinton_administration_controversies American_political_scandals Living_people _births People_from_Arkansas Arkansas_politicians American_tax_evaders Arkansas_lawyers 0.167

Wikitology derived features Seven features measured entity similarity using cosine similarity of various length article or category vectors Five features measured entity dissimilarity: two PER entities match different Wikitology persons two entities match Wikitology tags in a disambiguation set two ORG entities match different Wikitology organizations two PER entities match different Wikitology persons, weighted by 1-abs(score1-score2) two ORG entities match different Wikitology orgs, weighted by 1-abs(score1-score2) introduction  wikitology  applications  discussion  conclusion

COE Features Character-level features –Exact Match of NAM mentions Longest mention exact match Some mention exact match Multiple mention exact match All mention exact match –Partial Match Dice score, character bigrams Dice score, longest mention character bigrams Last word of longest string match –Matching nominals and pronominals Exact match Multiple exact match All match Dice score of mention strings Document-level features –Words Dice score, words in document Dice score, words around mentions Cosine score, words in document Cosine score, words around mentions –Entities Dice score, entities in document Dice score, entities around mentions Metadata features –Speech/text –News/non-news –Same document –Social context features Heuristic Probabilistic introduction  wikitology  applications  discussion  conclusion

More COE Features KB features - instances –Known alias Also derived aliases from test collection –BBN name match –Famous singleton KB features - semantic match –Entity type match –Sex match –Number match –Occupation match –Fuzzy occupation match –Nationality match –Spouse match –Parent match –Sibling match KB features - ontology –Wikitology Top Wikitology category matches Top Wikitology article matches Different top Wikitology person Different top Wikitology organization Top Wikitology categories in disambiguation set –Reuters topics Cosine score, words in document Cosine score, words around mentions –Thesaurus concepts Cosine score, words in document Cosine score, words around mentions introduction  wikitology  applications  discussion  conclusion

Clustering Approach –Assign score to each entity pair (SVM or heuristic) –Eliminate pairs whose score does not exceed threshold (0.95 for SVM runs) –Identify connected components in resulting graph Large clusters –AP (good) –Clinton (bad; conflates William and Hillary) –Sources of large clusters varied Connected components clustering SERIF errors Insufficient features to distinguish separate entities introduction  wikitology  applications  discussion  conclusion

Features with High F1 scores Recall that F1 = 2*P*R/(P+R) Variants of exact name match, in general, especially: a name mention in one entity exactly matches one in the other (83.1%) Cosine similarity of the vectors of top Wikitology article matches (75.1%) Top Wikitology article for the two entities matched (38.1%) An entity contained a mention that was a known alias of a mention found in the other (47.5%) introduction  wikitology  applications  discussion  conclusion

Feature Ablation A post hoc feature ablation evaluation showed contribution of KB features introduction  wikitology  applications  discussion  conclusion

High Precision Features High precision/low recall features are useful when applicable Features with precision > 95% include: –A name mentioned by each entity matches exactly one person in Wikipedia –The entities have the same parent –The entities have the same spouse –All name mentions have an exact match across the two entities –Longest named mention has exact match introduction  wikitology  applications  discussion  conclusion

Knowledge Base Population The 2009 NIST Text Analysis Confer- ence (TAC) will include a new Knowledge Base Population track Goal: discover information about named entities (people, organizations, places) and incorporate it into a KB TAC KBP has two related tasks: –Entity linking: doc. entity mention -> KB entity –Slot filling: given a document entity mention, find missing slot values in large corpus introduction  wikitology  applications  discussion  conclusion

KBs and IE are Symbiotic Knowledge Base Information Extraction from Text KB info helps interpret text IE helps populate KBs introduction  wikitology  applications  discussion  conclusion

Planned Extensions Make greater use of data from Linked Open Data (LOD) resources: DBpedia, Geonames, Freebase Replace ad hoc processing of RDF data in Lucene with a triple store Add additional graphs (e.g., derived from infobox links and develop algorithms to exploit them Develop a better hybrid query creation tools introduction  wikitology  applications  discussion  conclusion

Infobox Graph Infobox Graph IR collection Relational Database Relational Database Triple Store RDF reasoner Page Link Graph Category Links Graph Category Links Graph Articles Wikitology Code Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Application Specific Algorithms Wikitology 3.0 (2009) Linked Semantic Web data & ontologies Infobox Graph Infobox Graph

Challenges Wikitology tagging is expensive –~3 seconds/document –ACE English: ~150K entities (~24 hr on Bluegrit) –A spreading activation algorithm on the underlying graphs improves accuracy at even more cost Exploit the RDF metadata and data and the underlying graphs –requires reasoning and graph processing Extract entities from Wiki text to find more relations –More graph processing introduction  wikitology  applications  discussion  conclusion

Wikipedia’s social network Wikipedia has an implicit ‘social network’ that can help disambiguate PER mentions Resolving PER mentions in a short document to KB people who are linked in the KB is good The same can be done for the network of ORG and GPE entities

WSN Data We extracted 213K people from the DBpedia’s Infobox dataset, ~30K of which participate in an infobox link to another person We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links Consider a document that mentions two people: George Bush and Mr. Quayle

Which Bush & which Quayle? Six George BushesNine Male Quayles

A simple closeness metric Let Si = {two hop neighbors of Si} Cij = |intersection(Si,Sj)| / |union(Si,Sj) | Cij>0 for six of the 56 possible pairs 0.43 George_H._W._Bush -- Dan_Quayle 0.24 George_W._Bush -- Dan_Quayle 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle 0.02 George_Bush_(biblical_scholar) -- James_C._Quayle 0.02 George_H._W._Bush -- Anthony_Quayle 0.01 George_H._W._Bush -- James_C._Quayle

Application to TAC KBP Using entity network data extracted from Dbpedia and Wikipedia provides evidence to support KBP tasks: –Mapping document mentions into infobox entities –Mapping potential slot fillers into infobox entities –Evaluating the coherence of entities as potential slot fillers

Next Steps Construct a Web-based API and demo system to facilitate experimentation Process Wikitology updates in real-time Exploit machine learning to classify pages and improve performance Better use of cluster using Hadoop, etc. Exploit cell technology for spreading activation and other graph-based algorithms –e.g., recognize people by the graph of relations they are part of introduction  wikitology  applications  discussion  conclusion

Dbpedia ontology Dbpedia 3.2 (Nov 2008) added a manually constructed ontology with –170 classes in a subsumption hierarchy –880K instances – 940 properties with domain and range A partial, manual mapping was constructed from infobox attributes to these term Current domain and range constraints are “loose” Namespace: Place248,000 Person 214,000 Work 193,000 Species 90,000 Org. 76,000 Building 23,000

Person 56 properties

Organisation 50 properties

Place 110 properties

Exploiting Linked Data

Conclusion Our initial applications shows that the Wikitology idea has merit Wikipedia is increasingly being used as a knowledge source of choice Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia Serious use may require exploiting cluster machines and cell processing We need to move beyond Wikipedia to exploit the LOD cloud introduction  wikitology  applications  discussion  conclusion