Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Massimo Poesio Lecture 7: Wikipedia for Text Analytics

WIKIPEDIA The free encyclopedia that anyone can edit
Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Wikipedia's articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone who can access the Wikipedia website. ----

WIKIPEDIA Wikipedia is: 1. domain independent 2. up-to-date
it has a large coverage 2. up-to-date to process current information 3. multilingual to process information in many languages

Title Abstract Infoboxes Geo-coordinates Categories Images Links Other languages Other wiki pages To the web Redirects Disambiguates

WIKIPEDIA FOR TEXT ANALYTICS
Wikipedia has proven an extremely useful resource for text analytics, being used for Text classification / clustering Enriching documents through ‘Wikification’ NER Relation extraction ….

Wikipedia as Thesaurus for text classification / clustering
Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. However, it is more… Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. Up to date: Current and emerging concepts are absorbed timely. * Giles, J Internet encyclopaedias go head to head. Nature 438: 900–901.

Wikipedia as Thesaurus
Moreover, Wikipedia has a well-formed structure Each article only describes a single concept. The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.

Wikipedia Article that describes the Concept Artificial intelligence

Moreover, Wikipedia has a well-formed structure Each article only describes a single concept The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. Equivalent concepts are grouped together by redirected links.

AI is redirected to its equivalent concept Artificial Intelligence

Moreover, Wikipedia has a well-formed structure Each article only describes a single concept The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. Equivalent concepts are grouped together by redirected links. It contains a hierarchical categorization system, in which each article belongs to at least one category.

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society

Moreover, Wikipedia has a well-formed structure Each article only describes a single concept The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. Equivalent concepts are grouped together by redirected links. It contains a hierarchical categorization system, in which each article belongs to at least one category. Polysemous concepts are disambiguated by Disambiguation Pages.

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.

WIKIPEDIA FOR TEXT CATEGORIZATION / CLUSTERING
Objective: use information in Wikipedia to improve performance of text classifiers / clustering systems A number of possibilities: Use similarity between documents and Wikipedia pages on a given topic as a feature for text classification Use WIKIFICATION to enrich documents Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

WIKIPEDIA FOR TEXT CLASSIFICATION
Automatic identification of the topic/category of a text (e.g., computer science, psychology) Books Learning objects Vietnam War 0.0023 Cat: Wars Involving the United States United States 0.3793 World War I 0.0023 Ronald Reagan 0.0027 Communism 0.0027 Cat: Global Conflicts Michail Gorbachev 0.0023 Cold War 0.3111 “The United States was involved in the Cold War.” 17

USING WIKIPEDIA FOR TEXT CLASSIFICATION
Either directly use Wikipedia categories or map one’s categories to Wikipedia categories Use the documents associated with those categories as training documents

TEXT WIKIFICATION Wikification = adding links to Wikipedia pages to documents

WIKIFICATION Text: Wikipedia:
Giotto was called to work in Padua, and also in Rimini May 2012 Truc-Vien T. Nguyen

Wikification pipeline

Keyword Extraction Finding important words/phrases in raw text
Two-stage process Candidate extraction Typical methods: n-grams, noun phrases Candidate ranking Rank the candidates by importance Typical methods: Unsupervised: information theoretic Supervised: machine learning using positional and linguistic features

Keyword Extraction using Wikipedia
1. Candidate extraction Semi-controlled vocabulary Wikipedia article titles and anchor texts (surface forms). E.g. “USA”, “U.S.” = “United States of America” More than 2,000,000 terms/phrases Vocabulary is broad (e.g., the, a are included)

Keyword Extraction using Wikipedia
2. Candidate ranking tf * idf Wikipedia articles as document collection Chi-squared independence of phrase and text The degree to which it appeared more times than expected by chance Keyphraseness:

(Cfr. Milne & Witten 2008, 2012; Ratinov et al, 2011)
Our own Approach (Cfr. Milne & Witten 2008, 2012; Ratinov et al, 2011) Use Wikipedia dump to compute two statistics: KEYPHRASENESS: prior probability that a term is used to refer to a Wikipedia article COMMONNESS: probability that phrase is used to refer to specific Wikipedia article Two versions of system: UNSUPERVISED: use statistics only SUPERVISED: use distant learning to create training data 25 25 25

KEYPHRASENESS the probability that a term t is a link to a Wikipedia article (cfr. Milne & Witten’s prior link probability) Examples: The term "Georgia" Is found as a link in Wikipedia articles appears in Wikipedia articles  keyphraseness = 22631/75000 = Cfr. the term “the”: keyphraseness =

COMMONNESS the probability that a term t is a link to a SPECIFIC Wikipedia article a for example, the surface form "Georgia" was found to be linked to a1 = "University_of_Georgia" 166 times  commonness(t, a1) = 166/( ) = "Republic_of_Georgia" times "Georgia_(United_States)" 5 times

Extracting dictionaries and statistics from a Wikipedia dump
Parsing: In three phases Identify articles of relevance Extract (among other things) Set of SURFACE FORMS (terms that are used to link to Wikipedia articles) Set of LINKS [article|surface_form] [[Pedanius Dioscorides|Dioscorides]] 28 28 28

The Wikipedia Dump from July 2011
pages in total 12,525,583 links specifying: surface word / target / frequency ranked by frequency for example, the mention "Georgia" is linked to "University_of_Georgia" 166 times, "Republic_of_Georgia" times "Georgia_(United_States)" 5 times May 2012 Truc-Vien T. Nguyen

Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish Redirected 4,465,652 323,591 134,148 List_of 138,581 836 5,021 Disambiguation 176,721 6,193 4,553 Relevant 4,361,020 917,354 920,486 Total 11,459,639 1,654,258 1,200,313

Surface forms, titles, articles
Some definitions and figures surface form  the occurence of a mention inside an article target article  the target Wiki article a surface form linked to Dictionary English Italian Polish Titles 4,361,020 917,354 920,486 Surface forms 8,829,624 2,484,045 2,482,104 Files 745,724 72,126 n/a Links 10,871,741 2,917,235 2,937,981 Files in Polish are arranged in a repository different from English/Italian

The Unsupervised Approach
Use Keyphraseness to identify candidate terms Retain terms whose keyphraseness is above a certain threshold (currently 0.01) Use commonness to rank Retain top 10 32 32 32

The Supervised Approach
Features: in addition to commonness, use measures of SIMILARITY between text containing the term and the candidate Wikipedia page RELATEDNESS: a measure of similarity between the LINKS (cfr. Milne&Witten’s NORMALIZED LINK DISTANCE) 33 33 33

Training a supervised wikifier
Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets
APPROACH AQUAINT WIKIPEDIA Our approach 85.66 84.37 Milne&Witten 2008 83.61 80.31 Ratinov et al 2011 84.52 90.20

Wikifying queries: the Bridgeman datasets
BAL Data sets: 1049 Query set 1 annotator, up to 3 manual annotations 1 automatic annotation 100 Query set 3 annotators, each up to 3 manual annotations

Results on Bridgeman 1000: Y3
CORRECT CANDIDATE IS RESULTS First candidate 64.77 Among first 2 71.59 First 3 75.42 First 4 77.18 First 5 78.32 Accuracy up by 17 points (36%)

Results for the GALATEAS languages and Arabic
WIKIPEDIA SIZE RESULTS (on Wikipedia subset) English 4M articles 84.37 Italian 1M 79.64 French 1.4M 76-77 German 1.6M 72-73 Dutch 70-71 Polish 900K 60.81 Arabic 200K 80.78

The GALATEAS D2W web services
Available as open source Deployed within LinguaGrid API based on the Morphosyntactic Annotation Framework (MAF), an ISO standard Tested on 1.5M queries, achieves throughput of 600 characters per second Integrated with LangLog tool

Use of the service in LangLog
(See Domoina’s demo)

Other applications The UK Data Archive

WIKIPEDIA FOR NER [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. …..

WIKIPEDIA FOR NER http://en.wikipedia.org/wiki/FCC:
The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressional statute (see 47 U.S.C. § 151 and 47 U.S.C. § 154).

WIKIPEDIA FOR NER Number of glucocorticoid receptors in lymphocytes
and their sensitivity to hormone action .

WIKIPEDIA

WIKIPEDIA FOR NER Wikipedia has been used in NER systems
As a source of features for normal NER To automatically create training materials (DISTANT LEARNING) To go beyond NE tagging towards proper ENTITY DISAMBIGUATION

Distant learning Automatically extract examples
positive examples from mention-to-link Wikipedia page Negative examples from similar mentions with other links Use positive and negative examples to train model 47 47 47

The Supervised Approach: Using Wikipedia links to generate training data
Example Giotto was called to work in Padua, and also in Rimini. (sentence taken from Wikipedia text, with links avalable) Giotto_di_Bondone (painter), Giotto_Griffiths (Welsh rugby player), Giotto_Bizzarrini (automobile engineer) Dataset +1 Giotto was called to work -- Giotto_di_Bondone -1 Giotto was called to work -- Giotto_Griffiths -1 Giotto was called to work -- Giotto_Bizzarrini So, let’s look at more details about ACE ACE wants is to automatically infer from human language in documents. But what exactly it want to infer ? These are the entities being mentioned, the relations among these entities, and the events in which these entities participate. So it defines According to ACE definitions, an entity …, a relation … To support the ACE program, the ACE corpora was produced by Linguistic Data Consortium (LDC) May 2012 Truc-Vien T. Nguyen

MORE ADVANCED USES OF WIKIPEDIA
As a source of ONTOLOGICAL KNOWLEDGE DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
Taxonomic information: category structure Attributes: infobox, text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007)
Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007)
Induce a subsumption hierarchy

INFOBOXES Collaborative content Semi-structured data {{Infobox Writer
| bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]] ... For many applications, we may want to integrate semi-structured data that comes from multiple independent sources. Wikipedia is a great example of collaborative content where we find a rich and interesting collection of data from multiple authors. Especially interesting is the semi-structured data found in the infoboxes, which accompany many articles. The underlying representation of the Infoboxes in wikisource consist of attribute/value pairs. Note that the values may include wikilinks, which are references to other Wikipedia entries. These connections among infoboxes form a graph.

DBPEDIA DBpedia.org is a effort to :
extract structured information from Wikipedia make this information available on the Web under an open license interlink the DBpedia dataset with other datasets on the Web

The DBpedia Dataset 􀀟 1,600,000 concepts 􀀟 including 􀁺 58,000 persons
􀁺 70,000 places 􀁺 35,000 music albums 􀁺 12,000 films 􀀟 described by 91 million triples 􀀟 using 8,141 different properties. 􀀟 557,000 links to pictures 􀀟 1,300,000 links external web pages 􀀟 207,000 Wikipedia categories 􀀟 75,000 YAGO categories

REPRESENTING EXTRACTED INFORMATION
The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

Extracting Infobox Data (RDF Representation):
dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “ ”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council; ...

SPARQL : SPARQL is a query language for RDF. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

The DBpedia SPARQL Endpoint
􀀟 􀀟 hosted on a OpenLink Virtuoso server 􀀟 can answer SPARQL queries like 􀁺 Give me all Sitcoms that are set in NYC? 􀁺 All tennis players from Moscow? 􀁺 All films by Quentin Tarentino? 􀁺 All German musicians that were born in Berlin in the 19th century?

WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing efforts Other initiatives: Citizen Science, Cognition and Language Laboratory, … This has been taken advantage of in AI Open Mind Commonsense (Singh) (collecting facts) Semantic Wikis

WEB COLLABORATION PROJECTS
Open Mind Common Sense – Singh Crater mapping (results) – Kanefsky Learner / Learner2 / 1001 Paraphrases – Chklovski FACTory – CyCORP Hot or Not – 8 Days ESP / Phetch / Verbosity / Peekaboom – von Ahn Galaxy Zoo – Oxford University

OPEN MIND COMMONSENSE

CONCEPT NET

GAMES WITH A PURPOSE Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
GWAP do not rely on altruism or financial incentives to entice people to perform certain actions The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP Games at www.gwap.com Other games ESP Verbosity
TagATune Other games Peekaboom Phetch

ESP The first GWAP developed by von Ahn and their group (2003 / 2004)
The problem: obtain accurate description of images to be used To train image search engines To develop machine learning approaches to vision The goal: label the majority of the images on the Web

ESP: the game

ESP: THE GAME Two partners are picked at random from the large number of players online They are not told who their partner is, and can’t communicate with them They are both shown the same image The goal: guess how their partner will describe the image, and type that description Hence, the ESP game If any of the strings typed by one player matches the string typed by the other player, they score points

THE TASK

SCORING BY MATCHING

SOME STATISTICS In the 4 months between August 9th 2003 and December 10th 2003 13630 players 1.2 million labels for 293,760 images 80% of players played more than once By 2008: 200,000 players 50 million labels

QUALITY OF THE LABELS For IMAGE SEARCH:
choose 10 labels among those produced and look at which images are returned Compare labels produced by players with labels produced by participants in an experiment 15 participants, 20 images among the 1000 with more than 5 labels 83% of game labels also produced by participants Manual assessment of labels (‘would you use these labels to describe this image?’) 15 participants, 20 images 85% of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

PHRASE DETECTIVES

PHRASE DETECTIVES: THE TASKS
Find The Culprit (Annotation) User must identify the closest antecedent of a markable if it is anaphoric Detectives Conference (Validation) User must agree/disagree with a coreference relation entered by another user

NAME THE CULPRIT

READINGS Mihalcea, R. and Csomai, A. Wikify! linking documents to encyclopedic knowledge. Proceedings of CIKM’07, Lisbon, Portugal V. Nguyen & M. Poesio, Entity disambiguation and linking over queries using Encyclopedic Knowledge, Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data D. Lungley, M. Trevisan, V. Nguyen, M. Althobaiti, M. Poesio, GALATEAS D2W: A Multi-lingual Disambiguation to Wikipedia Web Service. Proc. Of ENRICH. V. Nastase& M. Strube, Transforming Wikipedia into a large scale multilingual concept network, Artificial Intelligence, 2012

READINGS L. von Ahn and L. Dabbish (2008). Designing games with a purpose. Communications of the ACM, v. 51, n.8, 58-67 Poesio, Chamberlain, Kruschwitz, Robaldo, & Ducceschi, Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems

Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Similar presentations

Presentation on theme: "Massimo Poesio Lecture 7: Wikipedia for Text Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Massimo Poesio Lecture 7: Wikipedia for Text Analytics

Similar presentations

Presentation on theme: "Massimo Poesio Lecture 7: Wikipedia for Text Analytics"— Presentation transcript:

Similar presentations

About project

Feedback