Preeti Bhargava, Nemanja Spasojevic, Guoning Hu

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data
Preeti Bhargava, Nemanja Spasojevic, Guoning Hu Applied Data Science, Lithium Technologies

Problem There is lot of unstructired text in the social netwoks, only once you can annotate text it becomes usefull for various (IR applications). Goal is to annotate example sentence with correct disambiguation on ambiguous terms.

Applications Tweets & other user generated text
User profile (interests & expertise) URL recommendations Content personalization Process social media texts

Challenges Ambiguity Multi-lingual content
High throughput and lightweight approach 0.5B documents daily (~1-2ms per tweet) commodity hardware (REST API, MR) Shallow NLP approach (no POS) Dense annotations (efficient information retrieval) The biggest challenges when running entity linking system in production come from ambiguity, requirement to support multilingual content, constrained resources given volume of data processed, Fact that some constrains impose simplistic approach like not using part of speech tagging, and for ability to annotate/link as densely as possible to improve efficiency of the IR tasks are more efficient and accurate if the underlying data is rich and dense.

Knowledge Base Freebase entities (top 1 million by importance)*
Balance coverage and relevance in respect to common social media text 2 special entities: NIL (‘the’ -> NIL) MISC (‘1979 USA Basketball Team’ -> MISC) In this study we use Freebase as knoweldge base of choice. We consider 1 milion most important entities, importance balances coverage and relevace, and in adition we have 2 special entities Nill and MISC. * Prantik Bhattacharyya and Nemanja Spasojevic. Global entity ranking across multiple languages. Poster WWW 2017 Companion

Data Set Internally Developed Open Data Set
Densely Annotated Wikipedia Text (DAWT)1,2: high precision and dense link coverage on average 4.8 times more links than original Wiki articles 6 languages The data set weused for training and derivation of the data resources was DAWT (desely annotated wikipedia text) it’s basically wikipedia texts with more dense links covering 6 languages. Data set is open If interested finding out more about it drop by Wiki Workshop postr session. Nemanja Spasojevic, Preeti Bhargava, and Guoning Hu DAWT: Densely Annotated Wikipedia Texts across multiple languages. WWW 2017 Wiki workshop (Wiki’17)

Text Processing pipeline
Here is illustration of how each stage would look like.

Text Processing pipeline
For this research 2 most important tasks are entity extraction and entity diambiguation & linking.

Entity Extraction Entity Extraction – candidate mention dictionary
consider n-grams (n ∈ [1,6]) phrases choose longest phrase within candidate dictionary Note that we do not use POS for entity extraction but relied on pre-calculated entity dictionary. We use greedy algorithm to extract longest n-grams for text that map to the mentions candidate dictionary.

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Here is example of presentation where we pick longest valid entity.

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Google CEO -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Google -> {045c7b} Google CEO -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest)

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {} Eric Schmidt said -> {}

Entity Extraction Google CEO Eric Schmidt said that competition between Apple and Google … Candidates Candidates Candidates Google -> {045c7b} CEO -> {0dq_5} Eric -> {03f078w, 0q9nx} Google CEO -> {} CEO Eric -> {} Eric Schmidt -> {03f078w} Google CEO Eric -> {} - Greedy strategy for entity extraction (longest) CEO Eric Schmidt -> {} Eric Schmidt said -> {} and so on …

Entity Disambiguation
Two-pass algorithm: disambiguates and links a set of easy mentions leverages these easy entities and several features to disambiguate and link the remaining hard mentions Fr entity disambiguation we use 2 pass algorithm In first we disambiguate all easy mentions, and in second pass we use easy entities and several other features help disambiguate hard mentions.

First PASS Use Mention-Entity Co-occurrence prior probability:
Only one candidate entity High prior probability given mention (> 0.9) Two candidate entities one being NIL/MISC - high prior probability given mention (> 0.75) Example of Mention-Entity Co-occurrence prior probability: Dielectrics 0b7kg:0.4863,_nil_:0.3836,_misc_:0.1301 lost village _nil_:0.7826,05gxzw:0.2029,_misc_:0.0145 Tesla _nil_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303 tesla _nil_:0.5345,03rhvb:0.4655 In the first pass we use mention entity co-occurance prior probability.

Second PASS Build context Document- easy entities
Entity – position, easy entities within window Build feature set: Context independent Mention-Entity-Co-occurrence Mention-Entity-Jaccard Entity-Importance Context dependent Entity-Entity-Co-occurrence Entity-Entity-Topic-Similarity In the second pass

Mention Entity Cooccurrence
Example of Mention-Entity Co-occurrence prior probability: dielectrics 0b7kg:0.4863,_none_:0.3836,_misc_:0.1301 lost village _none_:0.7826,05gxzw:0.2029,_misc_:0.0145 Tesla _none_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303 tesla _none_:0.5345,03rhvb:0.4655 Example: P(05d1y|’Tesla’) = 0.327 probabilities and relations across entities and topics

Mention Entity JACCARD
Captures alignment of the representative entity mention to observed mention. Mention entity jaccard similarity. Captures token level simmilarity of the entity representative surface form compared to mentioned form. Example: ‘Tesla’ vs ‘Tesla Motors’ => 0.5

Entity Importance Captures global importance of an entity perceived by casual observers. Another signal was importance wehere we used same importance as mentioned earlier in a talk. Rank was scaled so it maps to percentile within 0-1 score where 1.0 is most important.

Entity Entity Cooccurrence
Average co-occurrence of a candidate entity with the disambiguated easy entities in the context window. Add drawing ???

Entity-Entity Topic Semantic Similarity
Inverse of the minimum semantic distance between candidate entity’s topics and entities from easy entity window. In this case we use in house topical onthology which captures hiararchy of topics. For each entity we have it’s mapping to the topic based on which we calculate sematic similarity. Like inverse of distance between topic of given entities. Example: sim(‘Apple’, ‘Google’) = 1 / 4 = 0.25 sim(‘Apple’, ‘Food’) = 1 / 5 = 0.2

Disambiguation Use an ensemble of two classifiers:
Decision Tree classifier labels the feature vector as ‘True’ or ‘False’. Generate final scores using weights generated by the Logistic Regression classifier Final Disambiguation: Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins All candidate entities labeled as ‘False’, use highest scoring only if large score margin compared to next one.

Disambiguation Example
To looks at the real example of the output of the algorithm setp by step.

Use Mention-Entity Co-occurrence prior probability: Only one candidate entity High prior probability given mention (> 0.9) Two candidate entities one being NIL/MISC - high prior probability given mention (> 0.75) To looks at the real example of the output of the algorithm setp by step.

Final Disambiguation Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins All candidate entities labeled as ‘False’, use highest scoring only if large score margin compared to next one. To looks at the real example of the output of the algorithm setp by step.

To looks at the real example of the output of the algorithm setp by step.

Evaluation Ground truth test set: 20 English Wikipedia (18,773 mentions) Measured Precision, Recall, F-score, Accuracy

Evaluation Mention Entity Co-occurrence based features have the biggest impact Context helps (especially for longer texts

Evaluation – Per Language

Language Coverage Comparisons
Lithium EDL Google Cloud NL API Open Calais AIDA English Y Arabic Spanish French German Japanese

Coverage Comparisons Lithium EDL linked 75% more entities than Google NL (precision adjusted lower bound) Lithium EDL linked 104% entities more than Open Calais (precision adjusted lower bound) Finally one of the objectives for our system was to have large coverage. Compared to google for languages avaiabale we can see that we have at lest 75% more etities detected than google Natural Language.s

Example Comparisons

Runtime Comparisons Text preprocessing stage of the Lithium pipeline is about 30, ,000 times faster than AIDA Disambiguation runtime per unique entity extracted of Lithium pipeline is about 3.5 times faster than AIDA AIDA extracts 2.8 times fewer entities per 50kb of text

Conclusion Presented an EDL algorithm that uses several context-dependent and context-independent features Lithium EDL system recognizes several types of entities (professional titles, sports, activities etc.) in addition to named entities (people, places, organizations etc.) 75% more entities than state of the art systems EDL algorithm is language-agnostic and currently supports 6 different languages – applicable to real world data High throughput and lightweight 3.5 times faster than state-of-the-art systems such as AIDA

E-mail: team-relevance@klout.com
Questions?

Preeti Bhargava, Nemanja Spasojevic, Guoning Hu

Similar presentations

Presentation on theme: "Preeti Bhargava, Nemanja Spasojevic, Guoning Hu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Preeti Bhargava, Nemanja Spasojevic, Guoning Hu

Similar presentations

Presentation on theme: "Preeti Bhargava, Nemanja Spasojevic, Guoning Hu"— Presentation transcript:

Similar presentations

About project

Feedback