To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman
Motivation Microblogs are data gold mines! Twitter reports that it alone captures over 340M short messages per day Many applications on tweet information extraction Election results (Tumasjan et al., 2010) Disease spreading (Paul and Dredze, 2011) Tracking product feedback and sentiment (Asur and Huberman, 2010)... Existing tools (for example, NER) are often too limited Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011] 2
Entity Linking (Wikifier) in Tweets Oh Yes!! giants vs packers game now!! Touchdown!! Q1: Which phrase should be linked? (mention detection) Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation) 3
Contributions Proposed a new evaluation scheme for entity linking A natural evaluation scheme for microblogs A system that performs significantly better on tweets than other systems Learn to detect mention and perform linking jointly Outperform Tagme [Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1 What we have learned Mention detection is a difficult problem Entity information can help mention detection 4
Outline Task Definition (again!) Two stage versus Joint Model + Features Results + Analysis 5
What should be linked? Oh Yes!! giants vs packers game now!! Touchdown!! Comparing different Wikifiers is a tough problem [Cornolti, WWW 2013] Really, there is no good definition on what should be linked 6
Our Scenario 7 What people are talking about the movie “The Town” on twitter? Assume our customers are only interested in entities of certain types Movies; Video Games; Sports Team;… Type information can be directly inferred by the corresponding Wikipedia page Now, it is fair to compare different systems We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE
The Desired Results 8 Oh Yes!! giants vs packers game now!! Touchdown!!
Terminology 9 Oh Yes!! giants vs packers game now!! Touchdown!! Mention Candidates Entity Mentions Assignment
Related Work Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….] Given a document, create Wikipedia-like links Very difficult to evaluate/compare Mention detection and disambiguation are often treated separately NER [Li et al., 2012; Ritter et al., 2011,...] No Linking Limited Types KBP [Ji et al., 2010; Ji et al., 2011,...] Focus on disambiguation aspect 10
Outline Task Definition (again!) Two stage versus Joint Model + Features Results + Analysis 11
What approach should we use? Task: Wikifier to the entities of the certain types (all named entities) Approach 1: Train a general named entity recognizer for those types Link to entities from the output of the first stage Approach 2: Learn to jointly detect mention and disambiguate entities Take advantage of Wikipedia information Take advantage of type information into our model 12 Advanced model Limited Types; Adaptation
The Necessity of the Joint Approach The town is so so good, Don’t worry Ben, we already forgave you for Gigli Q: Is “the town” a mention? Deep analysis with knowledge is required Gigli is Ben Affleck’s movie, which did not receive a good review Ben Affleck is the lead actor in the movie “The Town” 13
Outline Task Definition (again!) Two stage versus Joint Model + Features Results + Analysis 14
Features 15 Oh Yes!! giants vs packers game now!! Touchdown!! Mention Specific Features Mention, Entity Pair Features 2-nd Order FeaturesType Features
Mention Specific Features 16
View Count The Wikipedia statistics Log exists for every hour Very valuable data View count is useful Sometimes the most linked entity in Wikipedia is not the most popular one “jersey shore” ==> ? Jersey Shore links: 441 views: Jersey Shore (TV_series) links: 324 views:
Second Order Features 18
Type Features The information content on Wikipedia are different from Twitter Wikipedia is informational; Tweets are actionable Misspelled words: “watchin, watchn, …… “ We want to find context for PER, LOC, ORG,… for tweets Step 1: train on a system Step 2: labeled 10 million unlabeled tweets Step 3: Collect popular contextual words for each type Step 4: train a new system with one new feature Check if the context match the type 19
Mining Contextual Words Entity TypeWords appearing before the mention Words appearing after the mention Personwr, dominating, rip, quarterback, singer, featuring, defender, rb, minister, actress, twitition, secretary tarde, format, noite, suffers, dire, admits, senators, urges, performs, joins TV Showsbs, assistir, assistindo, otm, watching, nw, watchn, viagra, watchin, ver skit, performances, premieres, finale, parody, marathon, season, episodes, spoilers, sketch 20
Procedure Testing: step 1 Given a tweet Tokenize it, remove symbols, segment hashtags Testing: step 2 For all k-gram words in the tweet, do table look up To find mention candidates and the entities they can link to Testing: step 3 Construct features and output the assignment with the trained model Learning: Structural SVM; Inference: Exact/Beamseach A rule-base system for categorizing Wikipedia 21
Outline Task Definition (again!) Two stage versus Joint Model + Features Results + Analysis 22
Data Train % Test % Test % We sample two sets of tweets Train, Test 1 from [Ritter 2011] Test 2 from Twitter with entertainment keywords “director, actress”…… is very high Many, many algorithms focus on disambiguation However, if the mention are correctly extracted, the system is already very good 23
Main Results TagMe [Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07] Cucerzan is designed for well-written documents We have a more principle way to handle mention detection than Tagme 24
Impact of Features Entity information helps mention detections Mining contextual words helps a bit Capturing Entity-Entity relation also improves the model 25 Feature TypeTest 1 Base + Cap. Rate45.6
Conclusion & Discussions We provide an experimental study on tweets Jointly detect mentions and disambiguate A structured learning approach What have we learned Mention detection is a difficult problem Entity information could potentially help mention detection Future work Explore the connections between the joint approaches and the two stage approaches [Illinois—ACL 2011, Aida-- VLDB 2011] A more principled way to handle context 26