Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Entity.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

The objective of an Entity Recognition and Disambiguation (ERD) system is to recognize mentions of entities in a given text, disambiguate them, and map.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information Retrieval
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
Overview of Search Engines
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia Presenter: Ziqi Zhang OAK Research Group, Department.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Short Text Understanding Through Lexical-Semantic Analysis
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen * (Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi.
Automatic Labeling of Multinomial Topic Models
Exploiting Wikipedia Inlinks for Linking Entities in Queries Entity Recognition and Disambiguation Challenge ACM SIGIR 2014 July 6-11, 2014 The 37 th Annual.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
CS 440 Database Management Systems Web Data Management 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Neighborhood - based Tag Prediction
Nam Khanh Tran L3S Research Center, Leibniz Universität Hannover
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 24: NER & Entity Linking
Information Retrieval
Discovering Emerging Entities with Ambiguous Names
Searching and browsing through fragments of TED Talks
Summarization for entity annotation Contextual summary
Graph and Link Mining.
A new era: Topic-based Annotators
Entity Linking Survey
Presentation transcript:

Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Entity Discovery in Web Contents Entity-centric Search & Analytics KB-enhanced Sentiment Analysis

When Page played Kashmir at Knebworth, his Les Paul was uniquely tuned. 2 Images taken from Wikipedia under CC BY-SA 3.0 Disambiguating Names to Entities

When Page played Kashmir at Knebworth, his Les Paul was uniquely tuned. 3 Images taken from Wikipedia under CC BY-SA possible combinations Disambiguating Names to Entities

4

CoherenceContext 5 Prior When Page played Kashmir at Knebworth, his Les Paul was uniquely tuned. 91% 5% How good do entity keyphrases and context tokens overlap?Are the disambiguated entities related? Led Zeppelin Jimmy Page Knebworth Festival … India Pakistan Pashmina … How often did “Kashmir” link to this entity in Wikipedia? Images taken from Wikipedia under CC BY-SA 3.0 Common Features for Disambiguation

Mention-Entity Popularity Weights Collect hyperlink anchor-text / link-target pairs from Wikipedia redirects Wikipedia links between articles and Interwiki links Web links pointing to Wikipedia articles query-and-click logs … Build statistics to estimate P[entity | name] Need dictionary with entities‘ names: full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp. short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … nicknames & aliases: Terminator, City of Angels, Evil Empire, … acronyms: LA, UCLA, MS, MSFT role names: the Austrian action hero, Californian governor, CEO of MS, … … plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her. [Mihalcea/Tarau 2007, Spitkovsky/Chang 2012] 6

7 Knebworth Festival Led Zeppelin Remasters John Paul Jones Mellotron Citation Titles Category Names Titles of Linking Articles Link Anchor Texts Mention-Entity Context

Keyphrases (kp) commonly occur only partially To score an entity, all keyphrase scores are summed “Songs written by Robert Plant” Kashmir was written by Page and Plant. cover Account for partial matchesWeight of contained tokens w Mention-Entity Context 8

Global IDF of a keyphrase token w in Wikipedia Mutual Information of a token w and an associated entity – How often does the token occur in the keyphrase set of an entity Mention-Entity Context 9

Entity-Entity Coherence Precompute overlap of incoming links for entities e1 and e2 Alternatively compute overlap of anchor texts for e1 and e2 or overlap of keyphrases, or similarity of bag-of-words, or … Optionally combine with type distance of e1 and e2 (e.g., Jaccard index for type instances) For special types of e1 and e2 (locations, people, etc.) use spatial or temporal distance Overview by [Ceccarelli et al.: CIKM 2013] 10

Joint Mapping: Prob. Factor Graph Collective Learning with Probabilistic Factor Graphs [Kulkarni et al.: KDD’09]: model P[m|e] by similarity and P[e1|e2] by coherence consider likelihood of P[e1 … ek|m1 … mk] factorize by all m-e pairs and e1-e2 pairs use MCMC, hill-climbing, LP etc. for solution 11

Joint Mapping: Dense Subgraph Compute dense subgraph such that: each m is connected to exactly one e (or at most one e) NP-hard  approximation algorithms [J. Hoffart et al.: EMNLP‘11] 12

Random Walks Algorithm for each mention run random walks with restart (like personalized PageRank with jumps to start mention(s)) rank candidate entities by stationary visiting probability very efficient, decent accuracy can be improved by judicious selection of mention order [Guo & Barbosa: CIKM 2014]       13

Coherence-aware Feature Engineering [Cucerzan: EMNLP 2007; Milne/Witten: CIKM 2008, Art.Int. 2013] Avoid explicit coherence computation by turning other mentions‘ candidate entities into features sim(m,e) uses these features in context(m) special case: consider only unambiguous mentions or high-confidence entities (in proximity of m) m e 14

TagMe: NED with Light-Weight Coherence [P. Ferragina et al.: CIKM‘10, WWW‘13] Reduce combinatorial complexity by using avg. coherence of other mentions‘ candidate entities for score(m,e) compute avg e i  cand(m j ) coherence (e i,e)  popularity (e i | m j ) then sum up over all m j  m („voting“) m e mjmj e1e1 e2e2 e3e3 15

Long-Tail and Emerging Entities last.fm /Nick_Cave/Weeping_Song wikipedia.org /Weeping_(song) wikipedia.org/ Nick_Cave last.fm /Nick_Cave/O_Children last.fm /Nick_Cave/Hallelujah wikipedia /Hallelujah_(L_Cohen) wikipedia /Hallelujah_Chorus wikipedia /Children_(2011 film) wikipedia.org/ Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. 16

Persons Songs Products Long-Tail entities not in Wikipedia Keyphrases, Not Links 17

Knebworth Festival Led Zeppelin Knebworth Festivalsong Rock guitarist … Physical Graffiti … Independent of links Good quality Pro Computationally intensive due to partial overlap Can be addressed using locality sensitive hashing Con Intuition: Related entities have highly overlapping keyphrase sets. KORE: Keyphrase Overlap RElatedness [J. Hoffart et al.: CIKM’12] 18

Long-Tail and Emerging Entities last.fm /Nick_Cave/Weeping_Song wikipedia.org /Weeping_(song) wikipedia.org/ Nick_Cave last.fm /Nick_Cave/O_Children last.fm /Nick_Cave/Hallelujah wikipedia /Hallelujah_(L_Cohen) wikipedia /Hallelujah_Chorus wikipedia /Children_(2011 film) wikipedia.org/ Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. Gunung Mulu National Park Sarawak Chamber largest underground chamber eerie violin Bad Seeds No More Shall We Part Bad Seeds No More Shall We Part Murder Songs Leonard Cohen Rufus Wainwright Shrek and Fiona Nick Cave & Bad Seeds Harry Potter 7 movie haunting choir Nick Cave Murder Songs P.J. Harvey Nick and Blixa duet Messiah oratorio George Frideric Handel Dan Heymann apartheid system South Korean film 19

“Washington’s Prism program was revealed by the whistleblower Snowden.” 20

Entity Keyphrases extracted from annotations of entities referred to by Name Keyphrases extracted from any document mentioning ”PRISM” PRISM (TV network) Prism PRISM (website) Prism (album) Prism [J. Hoffart et al.: WWW 2014] Harvesting Emerging Entity Keyphrases 21

Entity Keyphrases extracted from annotations of entities referred to by Name Keyphrases extracted from any document mentioning PRISM (TV network) Prism PRISM (website) Prism (album) Emerging Entity Keyphrases Prism [J. Hoffart et al.: WWW 2014] Harvesting Emerging Entity Keyphrases 22

Extracting Keyphrases from Text The PRISM program collects a wide range of data from a number of companies, e.g. Google and Facebook. The leaked National Security Agency (NSA) documents where obtained by the Guardian.... keyphrases defined by POS pattern filters for named entities and technical terms

Existing entity keyphrases harvested from Wikipedia Enrich by context of high-confidence disambiguations in input texts Knowledge Base US Government “White House”, 0.4 “Obama”, 0.4 “US President”, 0.3 “PRISM”, 0.3 Enriching Existing Entities

News article clusters over time. Harvest Entity Keyphrases Identify Emerging Entities Iterate over slices Knowledge Base Add new Entities June 6June 7June 8 Discovering Emerging Entities 25

NERD Online Tools J. Hoffart et al.: EMNLP 2011, VLDB P. Ferragina, U. Scaella: CIKM R. Isele, C. Bizer: VLDB Reuters Open Calais: Alchemy API: S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD D. Milne, I. Witten: CIKM L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL D. Ceccarelli, C. Lucchese,S. Orlando, R. Perego, S. Trani. CIKM A. Moro, A. Raganato, R. Navigli. TACL some use Stanford NER tagger for detecting mentions 26

Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Entity Discovery in Web Contents √ Entity-centric Search & Analytics KB-enhanced Sentiment Analysis

[H. Bast et al.: SIGIR 2014] 28

29

30

31

Use Case: Semantic Search over News stics.mpi-inf.mpg.de 32

Use Case: Semantic Search over News 33

Use Case: Analytics over News stics.mpi-inf.mpg.de/stats 34

Use Case: Semantic Culturomics [Huet et al.: AKBC‘13] based on entity recognition & semantic classes of KB over archive of Le Monde, Age 35

Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Entity Discovery in Web Contents √ Entity-centric Search & Analytics √ KB-enhanced Sentiment Analysis

Knowledge-enhanced Sentiment Analysis Goal: 1.(Recognize and disambiguate entities) 2.Identify overall sentiment 3.Understand which entities and aspects contribute to the sentiment, and how they contribute “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE 37

Knowledge-enhanced Sentiment Analysis Problem: No single, unambiguous sentiment term like good/nice/bad/horrible Solution: Common-sense and factual knowledge “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE 38

Polarity of Synsets SentiWordNet provides objectivity, polarity for WordNet synsets cold#1: having a low or inadequate temperature or feeling a sensation of coldness […] "a cold climate"; "a cold room"; "dinner has gotten cold”; "a cold beer” (P: 0 O: 0.25 N: 0.75) hot#1: used of physical heat; having a high or higher than desirable temperature; "hot stove"; "hot water” (P: 0 O: 1 N: 0) “The bar at the Hilton was hot but the beer was cold.” [Baccianella et al., LREC 2010] OBJECTIVEPOSITIVENEGATIVE 39

Understanding Entities and Aspects Goal: 1.(Recognize and disambiguate entities) 2.Identify overall sentiment 3.Understand which entities and aspects contribute to the sentiment, and how they contribute “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE 40

Understanding Entities and Aspects Goal: 1.(Recognize and disambiguate entities) 2.Identify overall sentiment 3.Understand which entities and aspects contribute to the sentiment, and how they contribute “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE OBJECTIVEPOSITIVENEGATIVE 41

Polarity of Multi-Word Sentiment Terms SenticNet provides polarity for multi-word sentiment terms Cold beer: P: 1.0 O: 0.0 N: 0 Hot bar: P: 0.0 O: 0.2 N: 0.8 [Cambria et al., AAAI 2014] “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE OBJECTIVEPOSITIVENEGATIVE POSITIVE NEGATIVE 42

Understanding Entities and Aspects Hot bar? hot#1: used of physical heat; having a high or higher than desirable temperature; "hot stove"; "hot water” (P: 0 O: 1 N: 0) hot#11: very popular or successful; "one of the hot young talents"; "cabbage patch dolls were hot last season” (P: O: N: 0) “The bar at the Hilton was hot but the beer was cold.” OBJECTIVEPOSITIVENEGATIVE OBJECTIVEPOSITIVENEGATIVE POSITIVE NEGATIVEPOSITIVE 43

Address ambiguity of sentiment terms, e.g. hot – He is a hot young talent. – The bar is hot and stuffy. Link ambiguous sentiment terms to – ConceptNet (vector space term similarity) – WordNet (graph similarity) Context terms are assigned a probability for creating a positive or negative sentiment for the ambiguous term, e.g. hot: – talent (P: 0.9) – stuffy (N: 0.8) Disambiguation of Sentiment Terms [Weichselbraun et al., Knowledge-Based Systems 2014] 44

Take-Home Lessons NERD is key for contextual knowledge High-quality NERD uses joint inference over various features: popularity + similarity + coherence State-of-the-art tools available & beneficial Maturing now, but still room for improvement, especially on efficiency, scalability & robustness Connecting unstructured texts to knowledge bases opens up new possibilities Semantic Search & Analytics already benefits Sentiment analysis needs KBs and disambiguation To identify companies and products as well as their aspects To understand opinion bearing words 45

Open Problems and Grand Challenges Robust disambiguation of entities, relations and classes Relevant for question answering & question-to-query translation Key building block for KB building and maintenance Entity name disambiguation in difficult situations Short and noisy texts about long-tail entities in social media Efficient interactive & high-throughput batch NERD a day‘s news, a month‘s publications, a decade‘s archive Effective entity-centric document retrieval and exploration Understand impact of KB on ranking and exploring documents and knowledge Fully automatic linking of Web and news texts to continuously updated KBs with high accuracy 46