1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Slides:



Advertisements
Similar presentations
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
L EARNERS ’ D ICTIONARY Deny A. Kwary
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
1 Chinese WordSketch Online, corpus-based summaries of word usage.
Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton.
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
1 Chinese WordSketch Engine Online, corpus-based summaries of word usage.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
The Current State of FrameNet CLFNG June 26, 2006 Fillmore.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
1 Collocationality and how to measure it Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Ontologies and Terminology and how they relate to lexicography Adam Kilgarriff Auckland 20121Kilgarriff: Ontologies and Terminology.
1 Word senses: a computational response Adam Kilgarriff.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction Chapter 1 Foundations of statistical natural language processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Learners' Dictionaries Oxford1948 Longman1978 Collins COBUILD1987 Macmillan2002 Macmillan2008 (bilingualized) Merriam-Webster2008 Jackson, Howard
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1 Word senses: a computational response Adam Kilgarriff.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Evaluating word sketches and corpora
Thesauruses for Natural Language Processing
Corpora, Language Technology and Maltese
Presentation transcript:

1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

May 2007 Adam Kilgarriff 2 Overview Research programme Examples: Corpus lexicography Word sketching Collocationality Thesauruses The long road

May 2007 Adam Kilgarriff 3 What is language?

May 2007 Adam Kilgarriff 4 What is language? In our heads

May 2007 Adam Kilgarriff 5 What is language? In our heads In texts and sound signals

May 2007 Adam Kilgarriff 6 What is language? In our heads In texts and sound signals Both

May 2007 Adam Kilgarriff 7 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)

May 2007 Adam Kilgarriff 8 Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz) Odd method for objective science Practical problems: coverage, arbitrariness

May 2007 Adam Kilgarriff 9 Methodology Study text “empiricist” (Locke, Hume) Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals

May 2007 Adam Kilgarriff 10 It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns

May 2007 Adam Kilgarriff 11 Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding 18 years of rapid ascent

May 2007 Adam Kilgarriff 12 A virtuous circle Pattern finding Lexicon Linguistic processing Corpus Part-of-speech tagging Parsing Lemmatizing More data → gets richer each time round

May 2007 Adam Kilgarriff 13 Example: corpus lexicography - four ages

May 2007 Adam Kilgarriff 14 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

May 2007 Adam Kilgarriff 15 Age 2: KWIC Concordances From 1980 Computerised

May 2007 Adam Kilgarriff 16 Age 2: KWIC Concordance

May 2007 Adam Kilgarriff 17 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method

May 2007 Adam Kilgarriff 18 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method

May 2007 Adam Kilgarriff 19 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no

May 2007 Adam Kilgarriff 20 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

May 2007 Adam Kilgarriff 21 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour

May 2007 Adam Kilgarriff 22 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

May 2007 Adam Kilgarriff 23 Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before

May 2007 Adam Kilgarriff 24 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

May 2007 Adam Kilgarriff 25 Euralex 2002

May 2007 Adam Kilgarriff 26 Euralex 2002 Can I have them for my language please

May 2007 Adam Kilgarriff 27 The Sketch Engine Input: any corpus, any language Lemmatised, part-of-speech tagged specification of grammatical relations Word sketches integrated with corpus query system Developer: Pavel Rychly, Brno

May 2007 Adam Kilgarriff 28 Users: Dictionary publishers Oxford UP, Collins, Chambers, Macmillan Universities Teaching, research Framenet Language teaching Self-registration for free trial account

May 2007 Adam Kilgarriff 29 Collocationality Which words are most ‘collocational’ Dictionary publishers Where to put ‘collocation boxes’ Language learners

May 2007 Adam Kilgarriff 30 VerbFreqMLE Prob x log = entropy Take Gain Offer See Enjoy ……… Clarify ……… Total Calculation of entropy for advantage (object relation).

May 2007 Adam Kilgarriff 31

May 2007 Adam Kilgarriff 32 place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),

May 2007 Adam Kilgarriff 33 Thesaurus a resource that groups words according to similarity

May 2007 Adam Kilgarriff 34 Manual and automatic Manual Roget, WordNets, many publishers Automatic Sparck Jones (1960s), Lin (1998) aka distributional two words are similar if they occur in same contexts Are they comparable?

May 2007 Adam Kilgarriff 35 Thesauruses in NLP sparse data

May 2007 Adam Kilgarriff 36 Thesauruses in NLP sparse data does x go with y? don’t know, they have never been seen together New question: does x+friends go with y+friends indirect evidence for x and y thesaurus tells us who friends are “backing off”

May 2007 Adam Kilgarriff 37 Relevant in: Parsing PP-attachment conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

May 2007 Adam Kilgarriff 38 Conjunction scope Compare old boots and shoes old boots and apples

May 2007 Adam Kilgarriff 39 Conjunction scope Compare old boots and shoes old boots and apples Are the shoes old?

May 2007 Adam Kilgarriff 40 Conjunction scope Compare old boots and shoes old boots and apples Are the shoes old? Are the apples old?

May 2007 Adam Kilgarriff 41 Conjunction scope Compare old boots and shoes old boots and apples Are the shoes old? Are the apples old? Hypothesis: wide scope only when words are similar

May 2007 Adam Kilgarriff 42 (demo)

May 2007 Adam Kilgarriff 43 Words and word senses automatic thesauruses words

May 2007 Adam Kilgarriff 44 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms

May 2007 Adam Kilgarriff 45 Words and word senses automatic thesauruses words manual thesauruses simple hierarchy is appealing homonyms “aha! objects must be word senses”

May 2007 Adam Kilgarriff 46 Problems Theoretical Practical

May 2007 Adam Kilgarriff 47 Theoretical

May 2007 Adam Kilgarriff 48

May 2007 Adam Kilgarriff 49

May 2007 Adam Kilgarriff 50 Wittgenstein Don’t ask for the meaning, ask for the use

May 2007 Adam Kilgarriff 51 Practical

May 2007 Adam Kilgarriff 52 Problems Practical a thesaurus is a tool if the tool organises words senses you must do WSD before you can use it WSD: state of the art, optimal conditions: 80%

May 2007 Adam Kilgarriff 53 Problems “To use this tool, first replace one fifth of your input with junk”

May 2007 Adam Kilgarriff 54 Avoid word senses

May 2007 Adam Kilgarriff 55 Avoid word senses This word has three meanings/senses

May 2007 Adam Kilgarriff 56 Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can build on it

May 2007 Adam Kilgarriff 57 Avoid word senses This word has three meanings/senses This word has three kinds of use well founded empirical we can build on it

May 2007 Adam Kilgarriff 58 sorry, roget

May 2007 Adam Kilgarriff 59 sorry, AI

May 2007 Adam Kilgarriff 60 sorry, AI AI model for NLP: NLP turns text into meanings AI reasons over meanings word meanings are concepts in an ontology a Roget-like thesaurus is (to a good approximation) an ontology Guarino: “cleansing” WordNet If a thesaurus groups words in their various uses (not meanings) not the sort of thing AI can reason over

May 2007 Adam Kilgarriff 61 “The Future of Search” Berkeley, 4 May 2007 NLP panel Pell, Hodjat, Russell The AI dream for 50 years Interpret Search on meanings Implicit: word senses

May 2007 Adam Kilgarriff 62 Quine “No entity without identity” Foundations for sound science Identity conditions for word senses? No word uses? Yes

May 2007 Adam Kilgarriff 63 “linguistics expressions prompt for meanings rather than express meanings” Fauconnier and Turner 2003 It would be nice if … But …

May 2007 Adam Kilgarriff 64 A virtuous circle Pattern finding Lexicon Linguistic processing Corpus Part-of-speech tagging Parsing Lemmatizing More data → gets richer each time round Word sketching Thesaurus

May 2007 Adam Kilgarriff 65 The long journey from text towards meaning Raw text Pure meaning Rationalists Empiricists

May 2007 Adam Kilgarriff 66 The long journey from text towards meaning Raw text Pure meaning Rationalists Empiricists lemmatizer POS-tagger parser thesaurus thematic relations/frame elements

May 2007 Adam Kilgarriff 67 Thank you