Linguisitics Levels of description
Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written speech” –Speech is not (usually) spoken text –Obviously they are related
Levels of description Smallest linguistic “unit” is the phoneme (speech) or (by analogy) the grapheme (text) Phonemes combine to form words, or more exactly, morphemes Morpheme: smallest meaningful unit of language Words combine to form sentences (or utterances) according to the rules of syntax Form is related to meaning via semantics Pragmatics deals with how language use relates to the real world
Phonetics Study of speech sounds Humans are the only species that have developed language –No dedicated speech organs as such Not all sounds are speech sounds, even though they do convey meaning Speech sounds combine in arbitrary ways to form words
Phonetics Articulatory phonetics concerned with how speech sounds are produced Acoustic phonetics concerned with physical properties of speech signal Auditory phonetics concerned with how speech sounds are perceived All are of course related
Possible speech sounds Range of sounds possible in human languages Consonants vs vowels Most consonants are pulmonic egressive Consonant sound is determined by place and manner of articulation, plus voicing, and some other features Vowel sound is determined by tongue height and position (front/back) plus lip shape (round/spread)
Phonemes Huge number of possible distinctions, but not all are significant in any given language Differences that are used to distinguish words are phonemic Phoneme – group of (similar) sounds perceived by speakers as “the same” Other differences between allophones Phonemic distinction in one language may be allophonic in another (-etic ~ -emic ~ allo- ~ -ology)
Prosody Besides individual speech sounds, other features of speech can carry meaning: –Length, volume, pitch –Intonation (pitch) Can be syntactic or lexical (in some languages) –Stress (combination of all three) Lexical or semantic/pragmatic
Writing and text Various writing systems worldwide Most familiar is alphabetic –Ideally each letter represents a sound (phoneme) –Rarely 1:1 mapping Phoneme can have different spellings Individual letter can be different phoneme Some phonemes represented by combination of letters (not always contiguous) Other possibilities: consonantal, syllabic, ideological, and various combinations
Graphemes Latin alphabet has 26 letters But English has ~50 phonemes Phoneme can have different spellings –/s/ can be ‘s’, ‘c’, ‘sc’, ‘ss’, … Individual letter can be different phoneme –‘c’ can be /s/ or /k/ Some phonemes represented by combination of letters –/θ/ ‘th’, /∫/ ‘sh’
Morphology Smallest meaningful unit of language is the morpheme Some words are single morphemes (meaning can’t be broken down), but many words have constituent parts Words usually consist of a root plus affix(es), though some words can have multiple roots Lexeme – abstract notion of group of word forms that belong together –lexeme ~ root ~ base form ~ dictionary (citation) form
Role of morphology Commonly made distinction: inflectional vs derivational Inflectional morphology is grammatical –number, tense, case, gender Derivational morphology concerns word building –part-of-speech derivation –words with related meaning
Morphological processes Affixes: prefix, suffix, infix, circumfix Umlaut, ablaut Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi
Language typology Based on extent to which morphological processes play a role Agglutinative – morphological affixes can be stacked up almost indefinitely –Implies that list of “possible words” is infinite Synthetic – little or no affixation Extent of morphology can interact with syntax: highly inflected languages often have freer word order
Morphemes Morphemes associated with meaning (Like phonemes) not 1:1 Single morpheme can have various allomorphs –Allomorphic variation usually conditioned, either intrinsically, or extrinsically (phonotactics, morphosyntax) –Can be “free variation” Single form can represent different morphemes Often rules of allomorphic variation are systematic
Inflectional morphology Grammatical in nature Does not carry meaning, other than grammatical meaning Highly systematic, though there may be irregularities and exceptions –Simplifies lexicon, only exceptions need to be listed –Unknown words may be guessable Language-specific and sometimes idiosyncratic (Mostly) helpful in parsing
Derivational morphology Lexical in nature Can carry meaning Fairly systematic, and predictable up to a point –Simplifies description of lexicon: regularly derived words need not be listed –Unknown words may be guessable But … –Apparent derivations have specialised meaning –Some derivations missing Languages often have parallel derivations which may be translatable
Issues for NLP Need scheme to handle morphology Can involve ambiguity which must be solved in analysis Can contribute to syntactic analysis –Morphological analysis identifies the lexeme plus grammatical information associated with inflections And vice versa –Morphological ambiguity may be resolved by syntactic context For many applications it is necessary to deal with just lexemes rather than word-forms and grammatical information: stemming
Morphological processing Stemming String-handling approaches –Regular expressions –Mapping onto finite-state automata 2-level morphology –Mapping between surface form and lexical representation Related issues of what is in lexicon