Preeti Bhargava, Nemanja Spasojevic, Guoning Hu

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Large-Scale Entity-Based Online Social Network Profile Linkage.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa 1, Kotaro Nakayama 2, Takahiro Hara 1, Shojiro.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
Search Engines and Information Retrieval
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts Zhe Zhao Paul Resnick Qiaozhu Mei Presentation Group 2.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
A Simple Approach for Author Profiling in MapReduce
Automatically Labeled Data Generation for Large Scale Event Extraction
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Semantic Processing with Context Analysis
Web News Sentence Searching Using Linguistic Graph Similarity
Erasmus University Rotterdam
GLOW- Global and Local Algorithms for Disambiguation to Wikipedia
Lecture 24: NER & Entity Linking
CIKM Competition 2014 Second Place Solution
CIKM Competition 2014 Second Place Solution
Intent-Aware Semantic Query Annotation
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
Introduction Task: extracting relational facts from text
Searching and browsing through fragments of TED Talks
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Leveraging Textual Specifications for Grammar-based Fuzzing of Network Protocols Samuel Jero, Maria Leonor Pacheco, Dan Goldwasser, Cristina Nita-Rotaru.
Text Annotation: DBpedia Spotlight
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Entity Linking Survey
Extracting Why Text Segment from Web Based on Grammar-gram
Introduction Dataset search
Presentation transcript:

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data Preeti Bhargava, Nemanja Spasojevic, Guoning Hu Applied Data Science, Lithium Technologies Email: team-relevance@klout.com

Problem There is lot of unstructired text in the social netwoks, only once you can annotate text it becomes usefull for various (IR applications). Goal is to annotate example sentence with correct disambiguation on ambiguous terms.

Applications Tweets & other user generated text User profile (interests & expertise) URL recommendations Content personalization Process social media texts

Challenges Ambiguity Multi-lingual content High throughput and lightweight approach 0.5B documents daily (~1-2ms per tweet) commodity hardware (REST API, MR) Shallow NLP approach (no POS) Dense annotations (efficient information retrieval) The biggest challenges when running entity linking system in production come from ambiguity, requirement to support multilingual content, constrained resources given volume of data processed, Fact that some constrains impose simplistic approach like not using part of speech tagging, and for ability to annotate/link as densely as possible to improve efficiency of the IR tasks are more efficient and accurate if the underlying data is rich and dense.

Knowledge Base Freebase entities (top 1 million by importance)* Balance coverage and relevance in respect to common social media text 2 special entities: NIL (‘the’ -> NIL) MISC (‘1979 USA Basketball Team’ -> MISC) In this study we use Freebase as knoweldge base of choice. We consider 1 milion most important entities, importance balances coverage and relevace, and in adition we have 2 special entities Nill and MISC. * Prantik Bhattacharyya and Nemanja Spasojevic. Global entity ranking across multiple languages. Poster WWW 2017 Companion

Data Set Internally Developed Open Data Set Densely Annotated Wikipedia Text (DAWT)1,2: high precision and dense link coverage on average 4.8 times more links than original Wiki articles 6 languages The data set weused for training and derivation of the data resources was DAWT (desely annotated wikipedia text) it’s basically wikipedia texts with more dense links covering 6 languages. Data set is open If interested finding out more about it drop by Wiki Workshop postr session. https://github.com/klout/opendata/tree/master/wiki_annotation Nemanja Spasojevic, Preeti Bhargava, and Guoning Hu. 2017. DAWT: Densely Annotated Wikipedia Texts across multiple languages. WWW 2017 Wiki workshop (Wiki’17)

Text Processing pipeline Here is illustration of how each stage would look like.

Text Processing pipeline For this research 2 most important tasks are entity extraction and entity diambiguation & linking.

Entity Extraction Entity Extraction – candidate mention dictionary consider n-grams (n ∈ [1,6]) phrases choose longest phrase within candidate dictionary Note that we do not use POS for entity extraction but relied on pre-calculated entity dictionary. We use greedy algorithm to extract longest n-grams for text that map to the mentions candidate dictionary.

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Here is example of presentation where we pick longest valid entity.

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Google CEO -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Google CEO -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {} Eric Schmidt said -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {} Eric Schmidt said -> {} and so on …

Entity Disambiguation Two-pass algorithm: disambiguates and links a set of easy mentions leverages these easy entities and several features to disambiguate and link the remaining hard mentions Fr entity disambiguation we use 2 pass algorithm In first we disambiguate all easy mentions, and in second pass we use easy entities and several other features help disambiguate hard mentions.

First PASS Use Mention-Entity Co-occurrence prior probability: Only one candidate entity High prior probability given mention (> 0.9) Two candidate entities one being NIL/MISC - high prior probability given mention (> 0.75) Example of Mention-Entity Co-occurrence prior probability: Dielectrics 0b7kg:0.4863,_nil_:0.3836,_misc_:0.1301 lost village _nil_:0.7826,05gxzw:0.2029,_misc_:0.0145 Tesla _nil_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303 tesla _nil_:0.5345,03rhvb:0.4655 In the first pass we use mention entity co-occurance prior probability.

Second PASS Build context Document- easy entities Entity – position, easy entities within window Build feature set: Context independent Mention-Entity-Co-occurrence Mention-Entity-Jaccard Entity-Importance Context dependent Entity-Entity-Co-occurrence Entity-Entity-Topic-Similarity In the second pass

Mention Entity Cooccurrence Example of Mention-Entity Co-occurrence prior probability: dielectrics 0b7kg:0.4863,_none_:0.3836,_misc_:0.1301 lost village _none_:0.7826,05gxzw:0.2029,_misc_:0.0145 Tesla _none_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303 tesla _none_:0.5345,03rhvb:0.4655 Example: P(05d1y|’Tesla’) = 0.327 probabilities and relations across entities and topics

Mention Entity JACCARD Captures alignment of the representative entity mention to observed mention. Mention entity jaccard similarity. Captures token level simmilarity of the entity representative surface form compared to mentioned form. Example: ‘Tesla’ vs ‘Tesla Motors’ => 0.5

Entity Importance Captures global importance of an entity perceived by casual observers. Another signal was importance wehere we used same importance as mentioned earlier in a talk. Rank was scaled so it maps to percentile within 0-1 score where 1.0 is most important.

Entity Entity Cooccurrence Average co-occurrence of a candidate entity with the disambiguated easy entities in the context window. Add drawing ???

Entity-Entity Topic Semantic Similarity Inverse of the minimum semantic distance between candidate entity’s topics and entities from easy entity window. In this case we use in house topical onthology which captures hiararchy of topics. For each entity we have it’s mapping to the topic based on which we calculate sematic similarity. Like inverse of distance between topic of given entities. Example: sim(‘Apple’, ‘Google’) = 1 / 4 = 0.25 sim(‘Apple’, ‘Food’) = 1 / 5 = 0.2

Disambiguation Use an ensemble of two classifiers: Decision Tree classifier labels the feature vector as ‘True’ or ‘False’. Generate final scores using weights generated by the Logistic Regression classifier Final Disambiguation: Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins All candidate entities labeled as ‘False’, use highest scoring only if large score margin compared to next one.

Disambiguation Example To looks at the real example of the output of the algorithm setp by step.

Disambiguation Example To looks at the real example of the output of the algorithm setp by step.

Disambiguation Example Use Mention-Entity Co-occurrence prior probability: Only one candidate entity High prior probability given mention (> 0.9) Two candidate entities one being NIL/MISC - high prior probability given mention (> 0.75) To looks at the real example of the output of the algorithm setp by step.

Disambiguation Example Final Disambiguation Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins All candidate entities labeled as ‘False’, use highest scoring only if large score margin compared to next one. To looks at the real example of the output of the algorithm setp by step.

Disambiguation Example To looks at the real example of the output of the algorithm setp by step.

Evaluation Ground truth test set: 20 English Wikipedia (18,773 mentions) Measured Precision, Recall, F-score, Accuracy

Evaluation Mention Entity Co-occurrence based features have the biggest impact Context helps (especially for longer texts

Evaluation – Per Language

Language Coverage Comparisons Lithium EDL Google Cloud NL API Open Calais AIDA English Y Arabic Spanish French German Japanese

Coverage Comparisons Lithium EDL linked 75% more entities than Google NL (precision adjusted lower bound) Lithium EDL linked 104% entities more than Open Calais (precision adjusted lower bound) Finally one of the objectives for our system was to have large coverage. Compared to google for languages avaiabale we can see that we have at lest 75% more etities detected than google Natural Language.s

Example Comparisons

Runtime Comparisons Text preprocessing stage of the Lithium pipeline is about 30,000- 50,000 times faster than AIDA Disambiguation runtime per unique entity extracted of Lithium pipeline is about 3.5 times faster than AIDA AIDA extracts 2.8 times fewer entities per 50kb of text

Conclusion Presented an EDL algorithm that uses several context-dependent and context-independent features Lithium EDL system recognizes several types of entities (professional titles, sports, activities etc.) in addition to named entities (people, places, organizations etc.) 75% more entities than state of the art systems EDL algorithm is language-agnostic and currently supports 6 different languages – applicable to real world data High throughput and lightweight 3.5 times faster than state-of-the-art systems such as AIDA

E-mail: team-relevance@klout.com Questions? E-mail: team-relevance@klout.com