Albert Gatt Corpora and Statistical Methods – Lecture 3.

Slides:



Advertisements
Similar presentations
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Advertisements

What is Word Study? PD Presentation: Union 61 Revised ELA guide Supplement (and beyond)
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Morphology.
Language & Mind Summer Words Perhaps the most conspicuous, most easily extractable aspect of language. Cf. phone, phoneme, syllable NB word vis.
Statistical NLP: Lecture 3
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
LIN 3098 Corpus Linguistics Lecture 7 Albert Gatt.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Stemming, tagging and chunking Text analysis short of parsing.
Morphology I. Basic concepts and terms Derivational processes
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
The study of the structure of words.  Words are an integral part of language ◦ Vocabulary is a dynamic system  How many words do we know? ◦ Infinite.
Emergence of Syntax. Introduction  One of the most important concerns of theoretical linguistics today represents the study of the acquisition of language.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
323 Morphology The Structure of Words 1.1 What is Morphology? Morphology is the internal structure of words. V: walk, walk+s, walk+ed, walk+ing N: dog,
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 4, Jan 15, 2007.
MORPHOLOGICAL PROCESSES Dr. Monira I. AL-Mohizea.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Experimental study of morphological priming: evidence from Russian verbal inflection Tatiana Svistunova Elizaveta Gazeeva Tatiana Chernigovskaya St. Petersburg.
Chapter Four Morphology
323 Morphology The Structure of Words 3. Lexicon and Rules 3.1 Productivity and the Lexicon The lexicon is in theory infinite, but in practice it is limited.
Ch4 – Features Consider the following data from Mokilese
Introduction Pinker and colleagues (Pinker & Ullman, 2002) have argued that morphologically irregular verbs must be stored as full forms in the mental.
Thinking about agreement. Part of Dick Hudson's web tutorial on Word Grammarweb tutorial.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Assessment of Morphology & Syntax Expression. Objectives What is MLU Stages of Syntactic Development Examples of Difficulties in Syntax Why preferring.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
323 Morphology The Structure of Words 3. Lexicon and Rules 3.1 Productivity and the Lexicon The lexicon is in theory infinite, but in practice it is limited.
Chapter 3 Morphology Lecturer : Qi Xiaowen 3.1 Introduction Definition: Morphology ( 形态学 ) is a branch of grammar which studies the internal structure.
The Analysis of Word Structure
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Parts of Speech (Lexical Categories). Parts of Speech n Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) n The building blocks of sentences n The.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
The Past Tense Model Psych /719 Feb 13, 2001.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY definition; variability among languages.
MORPHOLOGY. Morphology The study of internal structure of words, and of the rules by which words are formed.
Levels of Linguistic Analysis
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Morphology and Syntax- Week 5
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 11, Feb 9, 2007.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
MORPHOLOGY : THE STRUCTURE OF WORDS. MORPHOLOGY Morphology deals with the syntax of complex words and parts of words, also called morphemes, as well as.
Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied Linguistics, 28(2), 241–265.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Morphology 1 : the Morpheme
Constraints on Morphological Borrowing: Evidence from Latin America Ewald Hekking – Querétaro Dik Bakker – Lancaster 1Borrowing Morphology II.
Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Child Syntax and Morphology
عمادة التعلم الإلكتروني والتعليم عن بعد
Language, Mind, and Brain by Ewa Dabrowska
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Statistical NLP: Lecture 3
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Levels of Linguistic Analysis
English Linguistcis English Morphology Prof. Isabel Moskowich.
Introduction to English morphology
Introduction to Linguistics
Presentation transcript:

Albert Gatt Corpora and Statistical Methods – Lecture 3

Morphology and productivity Part 2

Morphology Many languages have multiple word forms related to a single base form (root form) Lexeme = base form from which related forms are produced Three classes of productive morphological processes: Inflection Derivation Compounding

Inflection Addition of prefixes and suffixes that leave core meaning intact leave grammatical category intact add/alter some features of meaning (especially relevant to syntax) Examples: -s to form plural nouns -ed to form past tense

Derivation Addition of prefixes and suffixes which: result in a more radical change in meaning often result in change of syntactic category Examples: English -ly (ADJ  ADV): wide-ly English -en (ADJ  V): weak-en English -able (V  ADJ): accept-able

Compounding Combination of two independent words into a new word NB new word can be orthographically one or several words can cause recognisable changes in phonology new compound has a new meaning (not necessarily 100% compositional) Example: English N-N compounds disk drive, mad cow disease, credit crunch

Regular vs. irregular Inflectional and derivational rules often have exceptions. E.g. Past tense in English: regular: -ed suffix irregular: bring – brought, ring - rang etc Sub-regularities observable: -ing/k verbs in English seem to display a particular pattern: rang, sank, …

Productive vs non-productive Some morphological processes or categories seem to have greater potential to form new words than others e.g. English -able, -ness compare to English –th: warmth, strength… (much less productive)

Classical approaches to productivity Jackendoff (1975): unproductive rules are called redundancy rules: e.g. warmth is listed in the English speaker’s (mental) lexicon as a single word the redundancy rule captures the knowledge that it can be split into warm+th rule as such isn’t really “active”, i.e. forms not produced online contrast with productive rules: e.g. Many adjectives with –able are produced “online”, not stored

Features of classical approaches 1. Relies on a binary distinction (un/productive) 2. Productive rules are typically regular & sub-regularities not considered much (Dressler 2003) 3. Most of these approaches do not look at corpus data Related psycholinguistic model: Pinker’s (1997) dual-route model of morphological processing

Corpus-based approaches View productivity as a gradable phenomenon: some forms become ingrained through frequent usage category can still be productive to some extent productivity estimated in terms of a category’s potential to produce new forms can account for sub-regularities: productivity of a category is due to a lot of factors, including analogy to existing words

The continuum Productive processes tend to: be compositional result in a lot of new words Productive morphological process lexicalised word ADJ+ness  Noun ADJ+th  Noun

Practical application (I) No finite lexicon can contain all words of a language at a certain time productive processes can be exploited to parse new/unseen lexical items this is helped by the compositionality of productive processes can also help to distinguish creative neologism from systematic rule- application. compare: well-defined, well-intentioned, well-specified lots of adjectives with a well- prefix YouTube a one-off

Practical application (II) Polarity/sentiment analysis: aim is to identify the overall positive/negative slant of a text concerning a topic Moilanen and Pulman (2008) obtain improvements by considering adjectives formed with well- vs –infested etc

Theoretical implications raises interesting questions about the relationship between corpus-based measures and psycholinguistic data likelihood of a morphological process being applied depends on style, genre, speech community… can give an indication of language change over time (some processes are fossilised, others become more productive)

Statistical measures of productivity (Baayen 2006)

What we need A measure of productivity of a process/category C should reflect: our intuitions about how frequently we encounter C how easily native speakers can form new words using C Is it easier to produce a noun with –th (like warmth) or one with –ness (like goodness)?

Realised productivity (RP) Given a morphological category C, RP gives a rough indication of the past utility of C in forming new words. Measured as the number of distinct types formed using C in a corpus of size N. E.g. regular past tense –ed displays many more types than sub-regular forms such as keep-kept/sleep-slept

Realised productivity cont/d Why types, not tokens? Productive processes have lots of types which are hapaxes, or are very infrequent. Words formed from irregular processes tend to be very frequent. Some limitations: a high RP for a category does not imply that it will keep forming lots of new words RP is heavily dependent on corpus size

Expanding productivity (P*) P* gives a rough indication of the rate of expansion of C. Focuses on the number of hapaxes produced using C in the corpus. aka hapax-conditioned productivity NB: P* is still heavily dependent on corpus size! No. of types formed using C which occur only once in N tokens No. of hapaxes in the corpus

Potential productivity (P) Gives an indication of how likely a category C is to form new words in future. I.e. the potential for C to be already saturated aka category-conditioned productivity No. of types in C which occur only once in corpus of N tokens No. of tokens of category C

Some more on P Unlike RP and P*, P is not very sensitive to corpus size as such However, very sensitive to frequency of the category. e.g. if C is realised only once in a corpus of size N, then P = 1! Recent empirical work has shown that RP and P* correlate very strongly, but both exhibit a weak correlation with P (Vegnaduzzo 2009) pattern non-X has high RP and P*, but low P pattern X-ish has low RP and P*, but high P

In graphics (after Baayen 2006) Corpus size No. of types Growth curve for a specific category Slope of tangent represents growth rate

P vs. RP and P* A category can have low RP and P*, but high P. Corresponds to the “ease” with which new words can be formed using the category. Even though a category has high RP, it may have reached “saturation”, so have low P.

The psycholinguistic connection 1. Rule vs. direct access:  To produce a word (e.g. illegal), you can either store it directly, or apply the rule on the fly.  Evidence suggests that frequency of baseform vs. derivation is related to which of the two alternatives apply.

The psycholinguistic connection 2. Complexity-based affix ordering:  Corpus research: more productive affixes follow less productive ones in word formation  It seems that more highly predictable (low productivity) affixes are processed first.  High productivity may also imply less likelihood of entering into further derivational processes.

Works cited S. Vegnaduzzo (2009). Morphological productivity rankings of complex adjectives. Proc. NAACL-HLT Workshop on Computational Approaches to Linguistic Creativity. K. Molinen and S. Pulman (2008). The good, the bad and the unknown: Morphosyllabic sentiment tagging of unseen words. Proc. ACL 2008 Baayen 2006 linked from web page