Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

Generation of Referring Expressions: Managing Structural Ambiguities I.H. KhanG. Ritchie K. van Deemter University of Aberdeen, UK.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
and other words that come before nouns Created by Robert Eastland
Food Preparation Basics
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,
 -400gr of mock angulas.  -5 cloves garlic.  -1 chili pepper.  -Olive oil.
How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
Thinking Maps for Reading Comprehension
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Logo Design. In order for a logo to "work" it must do the following things: symbolize forward thinking, contemporary qualities From this To This.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
© red ©
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Y7 Food Technology rotation: 1.Fruit Smoothie 2.Pizza Toast 3.Fruit Crumble 4.Flapjack 5.Cheesy Pasta Bake 6.High Fibre Muffins Note to parents /carers.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
Probabilities and Collecting Data. At a school carnival, there is a game in which students spin a large spinner. The spinner has 4 equal sections numbered.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
1 Word senses: a computational response Adam Kilgarriff.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Bag BasketBowl BottleBarrelCanGlass Continued US.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Measuring the pH Level in Household Substances Using Red Cabbage Juice as the pH Indicator Jocelyne Cortes Mrs. La Salle Chemistry/Period 3 5/28/2013.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
Monkey, Monkey In the Tree. Monkey, monkey in the tree Throw the yellow coconut down to me!
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
1 Word senses: a computational response Adam Kilgarriff.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Ontology Evaluation Outline Motivation Evaluation Criteria Evaluation Measures Evaluation Approaches.
Colours Презентацию выполнила Преподаватель МБУДО ДШИ им.Л.И.Ошанина
An Automatic Construction of Arabic Similarity Thesaurus
Introduction to Corpus Linguistics: Exploring Collocation
Machine Learning in Natural Language Processing
Thesauruses for Natural Language Processing
CSE 635 Multimedia Information Retrieval
Let’s play a guessing game?
Colours.
Thinking about Thinking
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton

Outline Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

What is a thesaurus? a resource that groups words according to similarity

Manual and automatic Manual –Roget, WordNets, many publishers Automatic –Sparck Jones (1960s), Grefenstette (1994), Lin (1998), Lee (1999) –aka distributional –two words are similar if they occur in same contexts Are they comparable?

Thesauruses in NLP sparse data

Thesauruses in NLP sparse data does x go with y? –don’t know, they have never been seen together New question: does x+friends go with y+friends –indirect evidence for x and y –thesaurus tells us who friends are –“backing off”

Relevant in: Parsing –PP-attachment –conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory?

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? alligator?

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No allegory+friends in upwaters? No alligator+friends in upwaters? Yes

PP-attachment investigate stromatolite with microscope/speckles –microscope: verb attachment –speckles: noun attachment inspect jasper with spectrometer –which?

PP attachment (cont) compare frequencies of –

PP attachment (cont) compare frequencies of – both zero? Try –

Conjunction scope Compare –old boots and shoes –old boots and apples

Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old?

Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old?

Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old? Hypothesis: –wide scope only when words are similar

Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old? Hypothesis: –wide scope only when words are similar hard problem: thesaurus might help

Bridging anaphor resolution –Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer

Bridging anaphor resolution –Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer How to find co-referring terms?

Text cohesion words on same theme –same segment change in theme of words –new segment same theme: same thesaurus class

Word Sense Disambiguation (WSD) pike: fish or weapon –We caught a pike this afternoon probably no direct evidence for –catch pike probably is direct evidence for –catch {pike,carp,bream,cod,haddock,…}

WordNet, Roget widely used for all the above

The WASPS thesaurus –credit: David Tugwell –EPSRC grant K8931 POS-tag, lemmatise and parse the BNC (100M words) Find all grammatical relations – 70 million triples

WASPS thesaurus (cont) Similarity: – one point similarity between beer and wine count all points of similarity between all pairs of words weight according to frequencies –product of MI: Lin (1998)

Word Sketches one-page summary of a word’s grammatical and collocational behaviour demo: the Sketch Engine –input any corpus –generate word sketches and thesaurus –just available now

Nearest neighbours to zebra

Nearest neighbours zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

Nearest neighbours cranewinchswanheron winchcraneherontern heronmastcranegull tractorrigginggullswan truckpumpterncrane swantractorcurlewflamingo

no clustering (tho’ could be done) no hierarchy (tho’ could be done) rhythm all on the web: –registration required

The web an enormous linguist’s playground –Computational Linguistics Special Issue, Kilgarriff and Grefenstette (eds) 29 (3) (coming soon)

Google sets Input: zebra giraffe buffalo

Google sets Input: zebra giraffe buffalo kudu hyena impala leopard hippo waterbuck elephant cheetah eland

Google sets Input: harbin beijing nanking

Google sets Input: harbin beijing nanking Output: shanghai chengdu guangzhou hangzhou changchun zhejiang kunming dalian jinan fuzhou

Tree structure Roget –all human knowledge as tree structure –1000 top categories subdivisions –like this »etc

Directories and thesauruses Yahoo, Open directory project, –all human activity as tree structure plus corpus at every node –gather corpus, identify domain vocabulary Gonzalo and colleagues, Madrid, CL Special Issue Agirre and colleagues, ‘topic signatures’

Words and word senses automatic thesauruses –words

Words and word senses automatic thesauruses –words manual thesauruses –simple hierarchy is appealing –homonyms

Words and word senses automatic thesauruses –words manual thesauruses –simple hierarchy is appealing –homonyms –“aha! objects must be word senses”

Problems Theoretical Practical

Theoretical

Wittgenstein Don’t ask for the meaning, ask for the use

Practical

Problems Practical –a thesaurus is a tool –if the tool organises words senses you must do WSD before you can use it –WSD: state of the art, optimal conditions: 80%.

Problems Practical –a thesaurus is a tool –if the tool organises words senses you must do WSD before you can use it –WSD: state of the art, optimal conditions: 80% “To use this tool, first replace one fifth of your input with junk”

Avoid word senses

This word has three meanings/senses

Avoid word senses This word has three meanings/senses This word has three kinds of use –well founded –empirical –we can study it

sorry, roget

sorry, AI

AI model for NLP: –NLP turns text into meanings –AI reasons over meanings –word meanings are concepts in an ontology –a Roget-like thesaurus is (to a good approximation) an ontology –Guarino: “cleansing” WordNet If a thesaurus groups words in their various uses (not meanings) –not the sort of thing AI can reason over

sorry, AI “linguistics expressions prompt for meanings rather than express meanings” –Fauconnier and Turner 2003 It would be nice if … But …

Evaluation manual thesauruses –not done automatic thesauruses: attempts –pseudo-disambiguation (Lee 1999) –with ref to manual ones (Lin 1998)

Task-based evaluation

Parsing –PP-attachment –conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

What is performance at the task –with no thesaurus –with Roget –with WordNet –with WASPS

Plans set up evaluation tasks theseval web-based thesaurus –Open Directory Project hierarchies campaign

Cyborgs Robots: will they take over? Rod Brooks’s answer: –Wrong question: greatest advances are in what the human+computer ensemble can do

Cyborgs A creature that is partly human and partly machine –Macmillan English Dictionary

Cyborgs and the Information Society The thedsaurus-making agent is part human (for precision), part computer (for recall).

Summary: Thesauruses for NLP Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

Thesaurus-makers of the future?