Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entity-oriented filtering of large streams John R. Frank Ian Soboroff Max Kleiman-Weiner Dan A. Roberts.

Similar presentations


Presentation on theme: "Entity-oriented filtering of large streams John R. Frank Ian Soboroff Max Kleiman-Weiner Dan A. Roberts."— Presentation transcript:

1 Entity-oriented filtering of large streams John R. Frank jrf@mit.edu Ian Soboroff ian.soboroff@nist.gov Max Kleiman-Weiner maxkw@mit.edu Dan A. Roberts drob@mit.edu http://trec-kba.org

2 Date: Tue, 13 Mar 2012 02:45:40 +0000 From: Google Alerts Subject: Google Alert - "John R. Frank" === Web - 2 new results for ["John R. Frank"] === John R. Frank SPOKANE, Wash. - John R. Frank, 55, died March 4, 2012, in Coeur d' Alene, Idaho. Survivors include: his wife, Miki; daughter, Patricia Frank;... In Memory of John R Frank Biography. John R. Frank, age 55, passed away at Sacred Heart Medical Center in Spokane, WA, on March 4, 2012. John was born in Hutchison, KS,...

3 2012 Task: Filtering to Recommend Citations 1)Initialize with a target WP entity state of WP from Jan 2012 2)Iterate over stream of text items Oct-Dec 2011: train on labels 3)For each, output relevance between 0, 1 Jan-Apr 2012: labels hidden Content Stream 462M texts, 40% English 4,973 hourly chunks of a 10 5 docs/hour News, blogs, forums, and link shortening Content Stream 462M texts, 40% English 4,973 hourly chunks of a 10 5 docs/hour News, blogs, forums, and link shortening Your KBA System Entities in Wikipedia or another Knowledge Base Automatically recommend new edits

4 s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/

5

6 Accelerate? rate of assimilation << stream size # editors << # entities << # mentions (definition of a “large” KB)

7 How many days must a news article wait before being cited in Wikipedia?

8 Complex entity with many relationships and attributes.

9 Has many interests, including trying to takeover UK soccer teams. His empire includes many entities… Note: Usmanov not mentioned in this text! Citation #18 Elaborate link trails…

10 Example KBA Rating Task Published: March 31, 2012 Impact of Thoughts on Water By Denis Gorce-Bourge Water covers 70% of our Blue planet and our body is made of about 70% water. Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness. For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals. Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water. The very shape of water crystals is modified by violence, aggression, and negative words.

11 Example KBA Rating Task Published: March 31, 2012 Impact of Thoughts on Water By Denis Gorce-Bourge Water covers 70% of our Blue planet and our body is made of about 70% water. Masaru Emoto is a Japanese Photographer and scientist. He is known over the world for his remarkable work on water and its deep connection with individual and collective consciousness. For decades, Masaru took pictures of frozen crystals of water and tested the direct influence of the environment on the quality of those crystals. Pollution has a direct impact on the beauty of a frozen crystal but as well words, music and thoughts. He tested the quality of water crystals by exposing it to various conditions: to written words like hate and violence and Love and gratitude. The results were just astonishing. The crystal exposed to Love and gratitude was beautiful and perfectly formed where the other one was severely degraded. He demonstrated as well the impact of Heavy Metal music versus Mozart or Beethoven and how the vibration of music impacts water. The very shape of water crystals is modified by violence, aggression, and negative words.

12

13 97.6% +/- 1.4% (N=5365) coref 69.5% +/- 2.7% (N=1352)central 70.9% +/- 2.0% (N=2403)relevant 58.4% +/- 3.4% (N=884)neutral 84.9% +/- 2.0% (N=2599)garbage 82.6% +/- 1.8% (N=3200)centralrelevant 89.0% +/- 1.7% (N=3551)centralrelevantneutral Interannotator Agreement

14 IR: User task centric Variation in interpretation Scores  cascading lists Constructionist, emergence NLP: Data parsing centric Universal annotation Scores  probabilities Reductionist TRECing the continental divide between NLP and IR

15

16 string matching task generator 91% recall 15% precision 26% F 1

17

18 KBA 2013 More entity types with an emphasis on temporality in the stream. Target EntitiesKBCentrally Relevant Training DataAnnotation People and Organizations Wikipedia or maybe Freebase Citation worthyJudgments from early stream High recall on all mentioning docs. Pharmaceutical Compounds Merck KB?Reporting of Adverse Drug Reaction (ADR) (an event) (same)Focus recall on first person reporting & negative reactions? Event-type Entities Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy WP/FB for cluster of entities, possibly also event itself. Provides causality info Judgments on docs for that Type-of-Event but different specific event. Find training data from TDT? Use citations in Category:Current _events? Judge post-hoc?

19 KBX Cold Start queries focused on: nil entities related to target cluster and/or causality of event Pool top-K filtered docs, or use each KBA run as separate KBP input. (1000x filter) Must coordinate choice of KBA target entities with desired content of KBs for Cold Start queries. KBA Stream Corpus 2012 (or the new Stream Corpus 2013) 462M texts, 40% English 4,973 hourly chunks of a 10 5 docs/hour News, blogs, forums, and link shortening KBA Stream Corpus 2012 (or the new Stream Corpus 2013) 462M texts, 40% English 4,973 hourly chunks of a 10 5 docs/hour News, blogs, forums, and link shortening Output KB Clusters of related entities and/or event-type entities Clusters of related entities and/or event-type entities KBP KBA

20 Sponsors: Thank You. Diffeo

21 Thanks for your time. John R. Frank jrf@mit.edu http://trec-kba.org

22 Lessons Learned (in progress) 97% coref, but 70% rating agreement  hard – one-in-twenty WP citations non-mentioning – Definition of “citable” varies across entities – Tension between IR “rating” and NLP “labeling” Lost >1 teams from challenges of crunching big data – Learn from Kaggle: score a run every day – AWS is really useful. KB feature mining & ML beat string matching – Too much training data? – Must exercise temporality in the stream… spikes & events.

23 NLP derives logical structures from messages. Might leverage O(n 2 ) and more to explore the problem space. Limits corpus size to ~10 6 docs. IR filters relevant messages from large streams. Limit algorithmic complexity to O(n) and simpler by forcing large corpora, 10 8 docs and higher. KBP KB A Future of Large Knowledge Bases Future of Large Knowledge Bases volume depth As NLP and IR converge, relevance concepts are gaining structure and logical inference is spreading across documents.

24 Users query the tagged text and occasionally create more structure Traditional Doc DB pipeline of NLP taggers add metadata to doc DB create structure once docs one-shot NLP pipeline (traditional approach)

25 Doc DB docs End users and adaptive tagging algorithms access the same content in the DB Paradigm Shift: persistent, adaptive NLP in the database

26 Methods in the Madness mean edit interval (days) mean mention interval (hours)

27 Has many interests, including trying to takeover UK soccer teams. His empire includes many entities… Elaborate link trails…


Download ppt "Entity-oriented filtering of large streams John R. Frank Ian Soboroff Max Kleiman-Weiner Dan A. Roberts."

Similar presentations


Ads by Google