Download presentation
1
LIN 3098 Corpus Linguistics – Lecture 4
Albert Gatt
2
LIN 3098 -- Corpus Linguistics
In this lecture Levels of annotation Corpus typology classification based on type and levels of annotation multilingual corpora LIN Corpus Linguistics
3
Levels of corpus annotation (cont/d)
Part 1 Levels of corpus annotation (cont/d)
4
Levels of linguistic annotation
part-of-speech (word-level) lemmatisation (word-level) parsing (phrase & sentence-level) semantics (multi-level) semantic relationships between words and phrases semantic features of words discourse features (supra-sentence level) phonetic transcription prosody LIN Corpus Linguistics
5
LIN 3098 -- Corpus Linguistics
Lemmatisation Groups morphological variants of a word under the head word: mexa’ (walk) imxejt (I walked) imxejna (we walked) nimxu (we walk) ... Increasingly common these days. Together , these form a lemma LIN Corpus Linguistics
6
Lemmatisation example: the SUSANNE corpus
Format: word + tag + lemma A05: VVDv said say Every word in the corpus is on separate line. Extremely useful for lexicography Corpus file:sentence.word POS tag actual word head word (lemma) LIN Corpus Linguistics
7
Automatic morphological analysis
For some languages, there are reasonably good lemmatisers/ morphological analysers: Examples for English: morpha: built at the University of Sussex EngTwol: commercial, by LingSoft. LIN Corpus Linguistics
8
LIN 3098 -- Corpus Linguistics
Engtwol output undeniable: "undeniable" <DER:ble> A ABS (derived with –ble suffix) adjective (A) absolute (ABS) form This is a rule-based analyser. There are others which use corpus-derived statistical patterns. LIN Corpus Linguistics
9
Semantic annotation I: Two types
markup of semantic relations (e.g. predicate-argument structure) currently used in parsed corpora, to mark up function-argument structures etc. markup of features of word meaning (mainly, word senses) has origins in content analysis to arrive at conclusions about how prominent particular concepts are Now used in a lot of work on word sense disambiguation LIN Corpus Linguistics
10
Example of type 1 semantic markup (Penn Treebank)
(S (NPSBJ1 Chris) (VP wants (S (NPSBJ *1) (VP to (VP throw (NP the ball)))))) Predicate Argument Structure: wants(Chris, throw(Chris, ball)) Empty embedded subject linked to NP subject no. 1 LIN Corpus Linguistics
11
Semantic markup type 2: lexical features
Most common type: word-sense tagged corpora Main idea: disambiguate a word in context by tagging its sense Often uses WordNet (Miller et al 1993) WordNet is a lexical taxonomy which represents lexical relations within a large number of words. including hyponymy (IS-A) relations etc For each entry, all the (supposed) senses of the word are given. Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense. LIN Corpus Linguistics
12
WordNet senses: Move (noun)
(377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer") (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire") (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility") (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path") (5) move -- ((game) a player's turn to take some action permitted by the rules of the game) LIN Corpus Linguistics
13
LIN 3098 -- Corpus Linguistics
WordNet senses: Move (verb) (130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell") (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant") (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right") (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another") LIN Corpus Linguistics
14
LIN 3098 -- Corpus Linguistics
Check it out! Wordnet is freely available for download: LIN Corpus Linguistics
15
Word sense annotation: other uses
tagging words with their semantic field (Wilson 1996) plant life men’s clothing … tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: social processes negative emotions This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, analyses a text and comes up with a profile of its personal/emotional content relates this to some features of its author (gender, age…) LIN Corpus Linguistics
16
LIN 3098 -- Corpus Linguistics
Discourse annotation Most common: text-level things such as paragraphs Less common: anaphoric NPs and reference (cf. example from lecture 3) Even less common: annotation of words which function as discourse cues (Stenstrom 1984): apology (sorry), hedges (sort of), etc annotation of rhetorical structure LIN Corpus Linguistics
17
Discourse: Annotating rhetorical structure (I)
Rhetorical Structure Theory (Mann and Thompson 1988): views text as made up of “discourse units” units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc CONCESSION/CONTRAST relation: [Although Mr. Freeman is retiring,] [he will continue to work as a consultant for American Express on a project basis]. Second unit is the main one (nucleus) First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact. Recent corpus (Marcu et al 2003) is annotated with this information. LIN Corpus Linguistics
18
Phonetic transcription
Not many phonetically transcribed corpora. MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus. Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis). LIN Corpus Linguistics
19
Annotating suprasegmentals
Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech. Some transcription systems exist TOBI (American) Tonic Stress Marker (TSM; British) define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc… LIN Corpus Linguistics
20
Problem-oriented tagging
If you’re interested in a particular problem, and no corpus exists, build your own! Many corpora define problem-specific annotation schemes. LIN Corpus Linguistics
21
Example: the TUNA Corpus
Problem: How do people refer to objects using definite NPs? Main interest: visual properties (colour, size etc) Focus: semantics of definite NPs, i.e. what people choose to include in their description. Method: experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene” annotation of descriptions based on semantics LIN Corpus Linguistics
22
TUNA Corpus: description
<DESCRIPTION NUM="SINGULAR"> <ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE> <ATTRIBUTE NAME="type" VALUE="sofa"> sofa </ATTRIBUTE> <ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE> </DESCRIPTION> Red sofa, bigger version. Features of the corpus: represents the “target” referent also represents the “distractors” (from which the target must be distinguished) semantically transparent: annotation goes beyond language LIN Corpus Linguistics
23
Part 2 Multilingual corpora
24
Why multilingual corpora?
comparative studies syntax morphology … the cornerstone of most research in automatic machine translation nowadays most MT systems are statistical, trained on large repositories of parallel (e.g. English-Chinese) text. LIN Corpus Linguistics
25
LIN 3098 -- Corpus Linguistics
Parallel corpora Represents a text in its original language (L1), with a translation in another language (L2) long history: Medieval polyglot bibles were among the first “parallel” corpora Alignment: Many parallel corpora align L1 and L2 at sentence level, sometimes also at word level… Sentence-level alignment can be achieved automatically with very high accuracy! LIN Corpus Linguistics
26
Example: SMULTRON corpus
Developed and released in Relatively small Aligned texts in English, Swedish and German E.g. Sophie’s World is one of the texts Annotated with syntax, POS, morphology Comes with a tool to view parallel syntactic trees. LIN Corpus Linguistics
27
SMULTRON example: English (Sophie’s World)
<s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/> <t id="s3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s3_3" word="was" pos="VBD" morph="--"/> <t id="s3_4" word="on" pos="IN" morph="--"/> <t id="s3_5" word="her" pos="PRP$" morph="--"/> <t id="s3_6" word="way" pos="NN" morph="--"/> <t id="s3_7" word="home" pos="RB" morph="--"/> <t id="s3_8" word="from" pos="IN" morph="--"/> <t id="s3_9" word="school" pos="NN" morph="--"/> <t id="s3_10" word="." pos="." morph="--"/> </terminals> </s> This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc) LIN Corpus Linguistics
28
SMULTRON: Same sentence in German
<s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals> </s> Note: richer morphology, representation of lemmas, … LIN Corpus Linguistics
29
LIN 3098 -- Corpus Linguistics
Translation corpora Not parallel. Have different texts in two or more different languages, of the same genre. Examples: PAROLE corpus is a translation corpus for EU languages LIN Corpus Linguistics
30
Why translation corpora?
Parallel corpora, by definition, contain translation (L2) can give rise to errors artificiality and translation quality can be an issue e.g. McEnery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation” Problem can be overcome if the texts used are professionally translated. Translation corpora have texts in two or more languages, “in the original”. Data is more natural. LIN Corpus Linguistics
31
LIN 3098 -- Corpus Linguistics
Summary We have now concluded our initial incursion into: corpus construction corpus annotation corpus typology Next up: using corpora for linguistic research LIN Corpus Linguistics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.