Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton.

Similar presentations


Presentation on theme: "Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton."— Presentation transcript:

1 Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton patrick.w.hanks@gmail.com EFNIL, Budapest, 25 October, 2012

2 Outline of the talk Technology and Lexicography – During the Renaissance – Now Philosophy, linguistics, and lexicography – During the Enlightenment – Now Lexicography of the future – The corpus revolution – Presenting the facts to the public – Can (should) natural language be regulated? 2

3 PART 1: Lexicography and technology Lexicography as we know it today is possible because of two technological developments during the Renaissance: The invention of printing (Gutenberg, Mainz, c. 1440) – Enabling many copies of a work to be disseminated rapidly, regardless of its size, bulk, and complexity. The invention of modern typography by Nicolas Jenson (Venice, 1470) And the scholarship of Aldus Manutius (1449-1515) in Venice Manutius collected Latin and Greek manuscripts from all over Europe and had them typeset and printed. But in the past 10 years this kind of lexicography has become obsolete! It has been superseded by a new kind of technology – text processing by computer. I will discuss this in part 3. 3

4 The typography of Gutenberg‘s Bible (c. 1455) 4

5 Nicholas Jensen’s Roman Antiqua typeface (c. 1468) 5

6 Palsgrave (1530) 6

7 R. Estienne (1531) 7

8 Calepino: Basle edition, 1550 8

9 Promptorium Parvulorum in print (Pynson 1499) 9

10 Present-day technology: lexicographical evidence hazard, verb. 1. No one at this stage is prepared to hazard a guess at the outcome of the poll on 2. name -- Chicken.” “Not Hen Chicken?” I hazarded, as this humorous diminutive was part 3. the wall. Stifling a giggle, she hazarded a guess that the wardrobe would be full 4. It seemed sensible to hazard that a man of this standing would have 5. can result in lost profits. When staff hazard a guess as to the price of goods – or 6. them as Part I and Part 2. One might hazard a guess that Part I was concerned with 7. North American standards. He does not hazard any opinions on how costs depend on the 8. ecoming proficient. Perhaps we can now hazard an attempt at defining `a good reader'. 9. builder, nor an architect, I can only hazard a guess. During construction in the mid-19 10. hair and eyes like her mother. I would hazard a guess and say she would be, at the time 11. Where do your art materials live? We hazard a guess that they're lurking in a shoebox 12. excitement than others, and I would hazard a guess that, even if they've never played 13. age and some movies date. I would hazard the guess that The Graduate belongs in 14. What the connection is we can only hazard a guess at but it confirms all our worst 15. have been lost and commandos were not hazarded in foolish risks, although often taking 16. shapes and colours from which we hazard the inference that a leaping dog is in 17. and a principle strong enough to hazard lives for, America cannot hope to lead 18. of the farmer is not revealed; we may hazard the guess that he was William Hardeley, 19. to begin restoring. But I'd hazard a guess that if you restore the directory 20. from time to time admire people who hazard their entire company on one major throw 21. supreme grade of evil'. It may be hazarded that it was this inevitable alliance with 22. his achievement, such as it was, and hazarded the opinion that he might best be remembered 23. in those stations' heyday, but I hazard a guess that considerably more passengers 24. the day's racing. In fact I would hazard a guess that one, if not both of these 25. of society itself. Indeed, one could hazard a further, and more general, observation 26. The Phillips curve. Although Phillips hazarded some theoretical conjectures concerning 10

11 PART 2: Philosophy, linguistics, and lexicography Do words have meaning? If not in words, where do meanings reside? – Nowhere! – Meanings are ephemeral interpersonal events, not stable objects with a ‘residence’. – But then how can anyone know what anyone else means? – What do philosophers say? – What do linguist say? And what is the nature of linguistic creativity? 11

12 Do words have meaning? Let’s think of a word: blow What does blow mean? 12

13 The meaning potential of a word What’s the meaning of blow? -- – What the wind does? A disappointment? Something you do with your fist? With your nose? With a whistle? Spend a lot of money? … What’s the meaning of blow up? – Destroying a building? What you do to a balloon? Lose your temper? … – All of these things and more! Words are hopelessly ambiguous. – A checklist of word meanings cannot, for principled reasons, be exhaustive. – But put a word in context, and ambiguity is reduced or eliminated. 13

14 Meaning potentials If words don’t have meanings, how come dictionaries have been so successful? Strictly speaking, dictionaries list meaning potentials, not meanings. – The distinction is subtle but the theoretical consequences are far-reaching – When consulting a dictionary, human beings use their imaginations to put words in a relevant context – a context for which they are already primed (Hoey 2005) – Computer programs and logic-based theories are not so primed. 14

15 Philosophical background H. P. Grice (1957, 1975) argued that meanings are not just in the head – they are events; interactions between people: – between speaker (S) and hearer (H); – (and with displacement in time) between writer and reader For this to work, S and H must share a body of linguistic conventions having the same meanings. Grice did not specify what the conventions are. – He left that task to linguists and lexicographers – So far, we seem to have let him down rather badly 15

16 Lexis and grammar Are the conventions that underlie conversational co- operation conventions of grammar (syntax)? – Only partly. Discussed in more detail in Hanks (2012): ‘How people use words to make meanings’. Perhaps the conventions that we rely on for conversational co-operation are words, with meanings as given in dictionaries? – But two decades of research in Word Sense Disambiguation by computational linguists (using LDOCE and other existing lexical resources) is now seen as a failure (Ide and Wilks 2005) – maybe, at least in part, because dictionaries don’t say enough about phraseology Something else is needed. 16

17 Firth and Sinclair “We must separate from the mush of general goings-on those features of repeated events which appear to be part of a patterned process.” —J. R. Firth (1950) 17

18 Idiomaticity vs. Open Choice “The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.” —Sinclair 1991. Corpus, Concordance, Collocation, p. 110 “Tending towards open choice is what we can dub the terminological tendency, which is the tendency for a word to have a fixed meaning in reference to the world.... tending towards idiomaticity is the phraseological tendency, where words tend to go together and make meanings by their combinations.” —Sinclair 2004. Trust the Text, p. 29 18

19 The importance of context “More often than not, activation of a particular meaning depends on the co-occurrence of two or more lexical items” – Sinclair – The study of collocations is still in its infancy – Empirical measurement of word co-occurrences (collocations) only became possible with very large corpora (i.e. since the early 1990s) – Problem with small corpora (Brown, LOB, ICE): Impossible to distinguish significant collocations from chance – We now have very large corpora – billions of words of texts Contemporary corpora, historical corpora, domain corpora, … – But serious analysis of corpus data has hardly started – It requires both new tools and revision of received theories 19

20 Idiom and Open Choice The range of collocational norms varies greatly from word to word What do you abandon? – a car, [ NO DET ] ship, an old fridge, a plan, a theory, a baby, a dog ( = as a pet), your wife and children, … – Very open choice in the direct object slot. What do you hazard? – The direct object slot is idiomatically highly constrained: just one word (guess) accounts for over 50% of uses of this verb 20

21 Exploiting the norm I hazarded various Stuartesque destinations like Florida, Bali, Crete and Western Turkey. —Julian Barnes Is it normal to hazard destinations or locations? – No. This is an exploitation of a norm. We need a theory (and an artefact) that distinguishes the normal, conventional, idiomatic phraseology of each word from exploitations of those phraseological norms 21

22 Extended context (Several exploitations here) Stuart needlessly scraped a fetid plastic comb over his cranium. —‘Where are you going? You know, just in case I need to get in touch.’ —‘State secret. Even Gillie doesn’t know. Just told her to take light clothes.’ He was still smirking, so I presumed that some juvenile guessing game was required of me. I hazarded various Stuartesque destinations like Florida, Bali, Crete and Western Turkey, each of which was greeted by a smug nod of negativity. I essayed all the Disneylands of the world and a selection of tarmacked spice islands; I patronised him with Marbella, applauded him with Zanzibar, tried aiming straight with Santorini. I got nowhere. 22

23 PART 3: Lexicography of the future Will draw on prototype theory (Rosch 1972) Will aim to map cognitive prototypes (meanings, beliefs, etc., associated with each word) onto phraseological prototypes of those words in use There will be an emphasis on analysing statistically significant collocations 23

24 James Murray (1878) predicts the need for corpus data “The editor and his assistants have to spend precious hours searching for examples of common everyday words. Thus, in the slips, we have 50 citations for abusion, but for abuse, not five.” – James Murray, Presidential address to the Philological Society, 1878 24

25 The need for a pattern dictionary To record all and only the normal patterns of use for each word – Not meanings – Not all possible patterns A pattern dictionary will be a benchmark against which actual usage can be measured Meanings, implicatures, translations, and whatever-else-you- like are attached to patterns (not to isolated words) – A word is no more than an entry point to a set of patterns 25

26 What is a pattern dictionary? A semantically driven syntagmatic inventory of normal word uses and meanings (implicatures). – Based on analysis of significant colligations and a statistically valid random sample. – Shows comparative frequency of each pattern of a polysemous word. Meanings are associated with patterns, not with words. – The colligational preferences of a word are part of its patterns. Created by means of a painstaking technique called Corpus Pattern Analysis (CPA). 26

27 Norms and exploitations A pattern dictionary aims to record all and only the normal uses of each word. – Exploitation of norms is a subject for separate analysis. – Types of ‘exploitation’ include creative metaphor, ellipsis, and (in particular) anomalous realizations. Consider: The goat ate the newspaper. The verb eat has a preference for nouns of semantic type [[Food]] in the direct object clause role. ‘[[Animate]] eat [[Document]]’ is not a normal pattern of English. Compare John devoured the newspaper. ‘[[Human]] devour [[Document]]’ is a normal pattern of English. It is a conventional metaphor. 27

28 Specifically,... The Pattern Dictionary of English Verbs aims to list all normal patterns of each verb lemma in BNC. – with practical applications and theoretical consequences (see later). A benchmark for comparative studies of and identification of norms in other corpora – by time period: historical corpora, future corpora – by region: e.g. American English – by domain, e.g. ‘[[Human]] abate [[Problem = Nuisance]]’ is a domain-specific norm in the domain of legal jargon abate is not normally a transitive verb. 28

29 A typical Pattern Dictionary entry irritate PATTERN 1 (90%): [[Anything]] irritate [[Human]] IMPLICATURE : [[Anything]] causes [[Human]] to feel mildly annoyed. PATTERN 2 (8%): [[Phys Obj | Stuff]] irritate [[Body Part]] IMPLICATURE : [[Phys Obj | Stuff]] causes [[Body Part]] to become inflamed and somewhat painful. Notes: 1. Both these patterns are transitive but they have different meanings. They are distinguished by the semantic types of the nouns 2. Getting the right level of semantic generalization for each noun is hard. It must select normal, prototypical uses – not all possible uses. 29

30 Semantic type vs. contextual role Mr Woods sentenced Bailey to seven years | life imprisonment PATTERN: [[Human 1]] sentence [[Human 2]] {to [[Time Period | Punishment]]} Semantic type: [[Human]] Contextual roles: [[Human 1 = Judge]], [[Human 2 = Convicted Criminal]], seven years [Time Period = Punishment in jail]] – Semantic type is an intrinsic semantic property of a lexical item. – Contextual role is extrinsic; the meaning is imposed (activated, selected) by the context in which the word is used. 30

31 Nouns and verbs The analytical apparatus required for nouns is different in kind from that required for predicators (verbs, adjectives). – Nouns are grouped into lexical sets in relation to the predicators that they normally collocate with. – The lexical sets are normally united by a semantic type. – A shallow ontology of nouns (grouped by their semantic type) is therefore part of the apparatus of a pattern dictionary. – Semantic types in real texts are more complex than might be expected at first sight or from invented examples. – Lexical sets include alternations, parts, and properties of types 31

32 What would an empirically well-founded ontology be like? (1) A hierarchy of about 250 semantic types (not more) Representing the intrinsic conceptual semantic properties of words – [[Eventuality]] and [[Entity]] at the top – [[Eventuality]] = [[Event | State of Affairs]] – [[Entity]] = [[Physical Object | Abstract Object]] Each semantic type is governed by corpus evidence of colligations, e.g.: [[Human]]s and [[Animal]]s eat, run, sleep, etc. [[Human]]s and [[Institution]]s think, say, negotiate, etc. So snakes (for this purpose) are not animals The hierarchy of [[Artefact]]s has many members, because different artefacts are used for different purposes (= with different verbs). Ref. James Pustejovsky, 1995. The Generative Lexicon. 32

33 What would an empirically well-founded ontology be like? (2) It would have to take account of verb-specific lexical alternations (parts and properties). For example, Pattern 2 (of 8) for calm, verb, is: [[Human 1 | Event]] calm [[Human 2]] – Alternation of Human (2): [[Animal]] – Parts of Human (2): nerves [[Body Part | Psyche Part]] – Attributes: [Possessive Determiner]] fear, anxiety, agitation,.... [[Emotion]] 33

34 Argument alternation and focus 1.Straightforward alternations: – People negotiate, governments negotiate, … – Humans eat, horses eat, dogs eat, alligators eat … – Horses gallop, humans gallop [ambiguous] 2. Another function of argument variation is focus: – repair one’s car, repair the fender, repair the damage – treat a person, treat his ankle, treat the injured, treat their injuries – The meaning of treat here contrasts with the meaning in treat a person well/badly The presence or absence of a manner adverbial is all-important 34

35 How to Measure Collocations? Various statistical tools are available, e.g.: Mutual Information (“MI”; Church and Hanks 1990) – tends to favour content words as collocates t-score tends to favour function words as collocates. Sketch Engine (Kilgarriff, Rychlý, et al., 2004) – measures salience scores for pairs of collocates in pre-determined colligational patterns Take your pick – but measuring must be done, one way or the other, if we are to have any hope of understanding the nature of meaning in language nd getting our dictionaries to report accurately how words are used – because a natural language is a fuzzy, variable, analogical, unstable system for making meanings 35

36 The Pattern Dictionary and FrameNet PDEV is corpus-driven (ruthlessly empirical) and proceeds word by word, investigating syntagmatic criteria for distinguishing different meanings of polysemous words, in a “semantically shallow” way. FrameNet proceeds frame by frame. It: expresses the deep semantics of situations (frames); proceeds frame by frame, not word by word; analyses situations in terms of frame elements; studies meaning differences and similarities between different words in a frame; does not explicitly study meaning differences of polysemous words; does not analyse corpus data systematically, but goes fishing in corpora for examples in support of hypotheses; has problems grouping words into frames, and misses some; has no established inventory of frames; has no criteria for completeness of a lexical entry. 36

37 Construction Grammar (1) Focus on meaning, not just well-formedness. Challenges reductionist theories of language Meaning is (in part) associated with constructions. Anything from a word to a clause can be a construction. – Example: ‘she slept her way to the top.’ – Sleep is not normally a goal-achievement verb. – But in this sentence, it is coerced into being one by the construction “[V] one’s way to [[Status]]”. – This meaning is not arrived at by a concatenation of the meanings of the lexical items of which the sentence is composed. 37

38 Construction Grammar (2) So far so good – but Construction Grammar is in the speculative tradition. It is not based on analysis of evidence. It is based largely on made-up examples, many of which are bizarre, e.g. The gardener watered the flowers flat. Corpus evidence shows that the verb water does not normally participate in the resultative construction. A distinction between normal usage and exploitation of norms must be made. – Abnormal examples are conducive to distortions in the theory. – CG needs corpus analysis. – Some sort of synthesis between PG and CG is desirable. 38

39 Theoretical consequences and practical applications (1) Pedagogical: Anyone acquiring a language must learn competence in two kinds of rule-governed linguistic behaviour: – How to use words normally – How to exploit the norms (creative metaphors, ellipsis, etc.) A pattern dictionary gives comparative frequency of patterns. – A lexical syllabus will focus on statistically significant patterns of use. In error analysis: what norm was aimed at? – If learners are exploiting norms creatively, do you (the teacher) really want them to? 39

40 Theoretical consequences and practical applications (2) For theoretical linguistics: Are some grammars better than others for representing how words are used to make meanings? ‘S  NP VP’: confuses of language with predicate logic The third argument (‘adjunct’, ‘adverbial’): – Not well analysed in generative grammar (or, indeed, any other grammar) – CPA shows that a new grammar of adverbials is needed. Metaphor analysis: – CPA distinguishes conventional metaphors from exploitations. 40

41 Theoretical consequences and practical applications (3) For computational linguistics and AI: Improving machine translation – Getting the right pattern is more likely to select the right translation. Parsing and word-class tagging: – CLAWS achieves ~90% accuracy in word-class tagging in BNC – CPA reveals some systematic errors in CLAWS tagging. Anaphora resolution: – He found a glass of water on the table and drank it. – ‘[[Animate]] drink [[Liquid]]’ selects water as a direct object of drink 41

42 Presenting the facts to the public (2) Dictionaries of the future will be electronic products – Space constraints removed – leading to a danger of verbosity They will pay more attention to phraseology and collocation Language communities will still need lexicographers to analyse the lexical content of corpora, Internet data, conversation, etc., and to identify the phraseological conventions on which successful communication depends You can’t just plonk language learners down in front of a concordance (corpus data) and expect them to work out what is going on. The data needs an interpretation. 42

43 Phraseological Lexicography and Computational Linguistics At present NLP applications such as machine translation are having great success with “knowledge-poor” statistical methods. – Sooner or later the pendulum will swing back: lexicographical methods will be needed to augment the raw statistical approach According to Ken Church, in 1987 the single most productive contribution to the NLP text-to-speech generation system at AT&T Bell Labs came from the IPA transcriptions in Collins dictionaries Can we expect a similar contribution from phraseological lexicography to computational message understanding? 43

44 Phraseological Lexicography and the Semantic Web Semantic Web: the original dream: – “Web technology must not discriminate between the scribbled draft and the polished performance.” –Berners-Lee, Hendler, and Lassila, in Scientific American 2001 At present Semantic Web research is very far from being able to interpret polished performances, let alone scribbled drafts – It confines itself to identifying names, dates, address, and appointments, and to processing tags that have been added to elements in text. – It is “the apotheosis of annotation – but what are its semantics?” (asks Yorick Wilks) Realizing the dream will require lexicographic input – a radical new kind of lexicography, one possibility for which I have tried to outline in this presentation. 44

45 A model presentation OED3 is a model of electronic presentation – but its lexicographical principles are old: they are (rightly) those of the Renaissance and the Enlightenment – These principles need revision in the light of corpus evidence – But you interfere with a national monument at your peril – One of many unacknowledged theoretical problems is a confusion between the (stipulative) meaning of scientific concepts and the meanings of words in natural language. Dictionaries of the future will be based on the principles of Wittgenstein, Rosch, Putnam, Grice, Firth, and Sinclair. 45

46 Can (should) natural language be regulated? (1) Johnson’s dictionary (1755) was based on citations from “the best authorities”. “Those who have been persuaded to think well of my design require that it should fix our language... “When we see men grow old and die... we laugh at the elixir that promises to prolong life to a thousand years; and with equal justice may the lexicographer be derided who, being able to produce no example of a nation that has preserved their words and phrases from mutability, shall imagine that his dictionary can embalm his language and secure it from corruption and decay.” —Preface, Dictionary, 1755 46

47 Can (should) natural language be regulated? (2 ) Johnson’s liberal empirical descriptivism is OK for English But what about other language situations, e.g. – Norwegian (institutionalized diglossia) – Czech (Every literate user of Czech must be able to use standard literary Czech, as well as his or her local dialect – but but standard literary Czech is not a natural language) – Greek? (katharevousa is obsolescent) – Langauges without a strong literary convention, e.g. Bantu languages, such as Northern Sotho, Zulu, Luganda. An element of prescriptivism seems to be inevitable here. – What about French? What is the role of the Académie Française in this brave new world of computational language processing? These are subjects on which I am not qualified to speak. 47

48 Thanks To you, for listening, To the late John Sinclair and the (still extant) James Pustejovsky, who have inspired this approach, To the Academy of Sciences of the Czech Republic (project T100300419) and the Czech Ministry of Education (National Research Program II project 2C06009), who, in part, funded the pilot study on which PDEV is based, And to Karel Pala, Pavel Rychlý, Adam Rambousek, and Adam Kilgarriff, who have created tools that make this kind of analysis possible 48

49 Invitation to browse the Pattern Dictionary Fire up a Firefox browser window. VISIT: http://nlp.fi.muni.cz/projects/cpahttp://nlp.fi.muni.cz/projects/cpa Pattern Dictionary of English Verbs: http://deb.fi.muni.cz/pdev/ 49


Download ppt "Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton."

Similar presentations


Ads by Google