Thesauruses for Natural Language Processing

Slides:



Advertisements
Similar presentations
Generation of Referring Expressions: Managing Structural Ambiguities I.H. KhanG. Ritchie K. van Deemter University of Aberdeen, UK.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,
How dominant is the commonest sense of a word? Adam Kilgarriff Lexicography MasterClass Univ of Brighton.
Evaluating the Waspbench A Lexicography Tool Incorporating Word Sense Disambiguation Rob Koeling, Adam Kilgarriff, David Tugwell, Roger Evans ITRI, University.
Thinking Maps for Reading Comprehension
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
© red ©
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
Probabilities and Collecting Data. At a school carnival, there is a game in which students spin a large spinner. The spinner has 4 equal sections numbered.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Word Sense Disambiguation (WSD)
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
1 Word senses: a computational response Adam Kilgarriff.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
1 Word senses: a computational response Adam Kilgarriff.
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
Assessment.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Assessment.
Colours Презентацию выполнила Преподаватель МБУДО ДШИ им.Л.И.Ошанина
Statistical NLP: Lecture 3
An Automatic Construction of Arabic Similarity Thesaurus
University of Oxford Dept of Education The Open University Maths Dept
Evaluating word sketches and corpora
Introduction to Corpus Linguistics: Exploring Collocation
Word Meaning and Similarity
Machine Learning in Natural Language Processing
Butterflies !!!.
Information Retrieval
WordNet WordNet, WSD.
Author Name Disambiguation in Medline
A method for WSD on Unrestricted Text
CSE 635 Multimedia Information Retrieval
Let’s play a guessing game?
Colours.
CS246: Information Retrieval
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
CS224N Section 3: Corpora, etc.
Thinking about Thinking
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Corpora, Language Technology and Maltese
Glazing Not painting!.
CS224N Section 3: Project,Corpora
Information Retrieval
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP: Lecture 10
Presentation transcript:

Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton

Outline Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

a resource that groups words according to similarity What is a thesaurus? a resource that groups words according to similarity

Manual and automatic Manual Automatic Are they comparable? Roget, WordNets, many publishers Automatic Sparck Jones (1960s), Grefenstette (1994), Lin (1998), Lee (1999) aka distributional two words are similar if they occur in same contexts Are they comparable?

Thesauruses in NLP sparse data

Thesauruses in NLP sparse data does x go with y? New question: don’t know, they have never been seen together New question: does x+friends go with y+friends indirect evidence for x and y thesaurus tells us who friends are “backing off”

Relevant in: Parsing Bridging anaphors Text cohesion PP-attachment conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory?

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? alligator?

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No

Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No allegory+friends in upwaters? No alligator+friends in upwaters? Yes

PP-attachment investigate stromatolite with microscope/speckles microscope: verb attachment speckles: noun attachment inspect jasper with spectrometer which?

PP attachment (cont) compare frequencies of <inspect, with, spectrometer> <jasper, with, spectrometer>

PP attachment (cont) compare frequencies of both zero? Try <inspect, with, spectrometer> <jasper, with, spectrometer> both zero? Try <inspect+friends, with, spectrometer+friends> <jasper+friends, with, spectrometer+friends>

Conjunction scope Compare old boots and shoes old boots and apples

Conjunction scope Compare Are the shoes old? old boots and shoes old boots and apples Are the shoes old?

Conjunction scope Compare Are the shoes old? Are the apples old? old boots and shoes old boots and apples Are the shoes old? Are the apples old?

Conjunction scope Compare Are the shoes old? Are the apples old? old boots and shoes old boots and apples Are the shoes old? Are the apples old? Hypothesis: wide scope only when words are similar

Conjunction scope Compare Are the shoes old? Are the apples old? old boots and shoes old boots and apples Are the shoes old? Are the apples old? Hypothesis: wide scope only when words are similar hard problem: thesaurus might help

Bridging anaphor resolution Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer

Bridging anaphor resolution Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer How to find co-referring terms?

Text cohesion words on same theme change in theme of words same segment change in theme of words new segment same theme: same thesaurus class

Word Sense Disambiguation (WSD) pike: fish or weapon We caught a pike this afternoon probably no direct evidence for catch pike probably is direct evidence for catch {pike,carp,bream,cod,haddock,…}

WordNet, Roget widely used for all the above

The WASPS thesaurus POS-tag, lemmatise and parse the BNC (100M words) credit: David Tugwell EPSRC grant K8931 POS-tag, lemmatise and parse the BNC (100M words) Find all grammatical relations <obj, climb, bank> <modifier, big, bank> <subject, bank, refuse> 70 million triples

WASPS thesaurus (cont) Similarity: <obj, drink, beer> <obj, drink, wine> one point similarity between beer and wine count all points of similarity between all pairs of words weight according to frequencies product of MI: Lin (1998)

Word Sketches one-page summary of a word’s grammatical and collocational behaviour demo: http://wasps.itri.bton.ac.uk the Sketch Engine input any corpus generate word sketches and thesaurus just available now

Nearest neighbours to zebra

Nearest neighbours zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

Nearest neighbours crane winch swan heron tern mast gull tractor rigging truck pump curlew flamingo

no clustering (tho’ could be done) no hierarchy (tho’ could be done) rhythm all on the web: http://wasps.itri.bton.ac.uk registration required

The web an enormous linguist’s playground Computational Linguistics Special Issue, Kilgarriff and Grefenstette (eds) 29 (3) (coming soon)

Google sets http://labs.google.com/sets Input: zebra giraffe buffalo

Google sets http://labs.google.com/sets Input: zebra giraffe buffalo kudu hyena impala leopard hippo waterbuck elephant cheetah eland

Google sets http://labs.google.com/sets Input: harbin beijing nanking

Google sets http://labs.google.com/sets Input: harbin beijing nanking Output: shanghai chengdu guangzhou hangzhou changchun zhejiang kunming dalian jinan fuzhou

Tree structure Roget all human knowledge as tree structure 1000 top categories subdivisions like this etc

Directories and thesauruses Yahoo, http://www.yahoo.com Open directory project, http://dmoz.org all human activity as tree structure plus corpus at every node gather corpus, identify domain vocabulary Gonzalo and colleagues, Madrid, CL Special Issue Agirre and colleagues, ‘topic signatures’

Words and word senses automatic thesauruses words

Words and word senses automatic thesauruses manual thesauruses words simple hierarchy is appealing homonyms

Words and word senses automatic thesauruses manual thesauruses words simple hierarchy is appealing homonyms “aha! objects must be word senses”

Problems Theoretical Practical

Theoretical

Wittgenstein Don’t ask for the meaning, ask for the use

Practical

Problems Practical . a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80% .

Problems Practical a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80% “To use this tool, first replace one fifth of your input with junk”

Avoid word senses

Avoid word senses This word has three meanings/senses

Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can study it

sorry, roget

sorry, AI

sorry, AI AI model for NLP: NLP turns text into meanings AI reasons over meanings word meanings are concepts in an ontology a Roget-like thesaurus is (to a good approximation) an ontology Guarino: “cleansing” WordNet If a thesaurus groups words in their various uses (not meanings) not the sort of thing AI can reason over

sorry, AI “linguistics expressions prompt for meanings rather than express meanings” Fauconnier and Turner 2003 It would be nice if … But …

Evaluation manual thesauruses automatic thesauruses: attempts not done pseudo-disambiguation (Lee 1999) with ref to manual ones (Lin 1998)

Task-based evaluation

Task-based evaluation Parsing PP-attachment conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

What is performance at the task with no thesaurus with Roget with WordNet with WASPS

Plans set up evaluation tasks theseval web-based thesaurus campaign Open Directory Project hierarchies campaign

Cyborgs Robots: will they take over? Rod Brooks’s answer: Wrong question: greatest advances are in what the human+computer ensemble can do

Cyborgs A creature that is partly human and partly machine Macmillan English Dictionary

Cyborgs and the Information Society The thedsaurus-making agent is part human (for precision), part computer (for recall).

Summary: Thesauruses for NLP Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

Thesaurus-makers of the future?