Download presentation
Presentation is loading. Please wait.
Published byErika Townsend Modified over 6 years ago
1
Updating the LINDSEI for Pragmatic Analysis and Evaluation
2
Outline Background on LINDSEI Positive Design Aspects
Shortcomings of the Annotation Scheme Mix of Tasks Mix of Conventions Absence of Punctuation/Structuring The Proposed Format Adding/Added Feature Information Examples of Analysis Potential
3
Background on LINDSEI LINDSEI = Louvain International Corpus of Spoken English Interlanguage NNS data from 11 language backgrounds started in 1995 ver. 1 released in 2002, ver. 2 in 2009 CD-ROM with simple database interface NS counterpart LOCNEC (Louvain Corpus of Native English Conversation)
4
Positive Design Aspects
transcription conventions generally well adhered to apparently few inconsistencies few ‘idiosyncratic’ additions/applications relatively easy to convert unfilled pauses generally marked for relevant speaker, i.e. turn-initially attempts at encoding data in Unicode (UTF-16) representing some additional relevant phonologi-cal/prosodic features, i.e. strong forms of articles
5
Shortcomings of the Annotation Scheme (I) – Mix of Tasks
3 task sub-genres within one file set topic for ‘warm-up’ 3 choices too much variation in topics free discussion picture description contravenes Sinclair’s ‘integrity principle’ “The integrity and representativeness of complete artefacts is far more important than the difficulty of reconciling texts of different dimensions” (2005) better to keep ‘sub-genres’ separate
6
Shortcomings of the Annotation Scheme (II) – Mix of Conventions
(pseudo-)SGML tags for speaker turns (<A>…</A>, <B>…</B>) unclear words/passages (<X>, <XX>, <XXX>) uncertain words/endings (<?>) foreign words (<foreign></foreign>) ‘CA conventions’ pauses (., .., …) truncation (=) lengthening (:) article pronunciation ([ei] for [eɪ] & [i:] for [ðiː])
7
Shortcomings of the Annotation Scheme (III) – Absence of Punctuation/Structuring
leads to ‘run-on’ text e.g. I'll talk about one of the films that I think is a good film the name of the film is . dangerous mind it's about a teacher who (em) . worked .. in the[i:] . slum . school […] no easy option for identifying, distinguishing, & understanding different functional units no easy option for distinguishing different syntax types/functions & grammar-related errors issues in ‘countability’ number of functional units? (required for accurate norming) distinction initial/medial/final pauses/hesitation phenomena etc.
8
The Proposed Format (I) – ‘Simple XML’
encoded in UTF-8 more efficient/space saving than UTF-16 standard for XML allows representing phonological & non-English info adequately simplified ‘hierarchy’ low number of ‘container’ elements more empty ones more consistency in representing annotation features (avoiding mix) easier to trap/avoid annotation errors display can be controlled via style sheets enhanced readability/usability/searchability
9
The Proposed Format (I) – Structure
<?xml version="1.0"?> (optional style sheet for rendering) <dialogue id="CH013_free_interview" corpus="LINDSEI" sub-corpus="CH" lang="en"> <turn speaker="interviewer_CH013" n="1"> [unit 1] [unit 2] </turn> <turn speaker="learner_CH013" n="2"> [unit 3] [unit 4] </dialogue> XML declaration dialogue container dialogue id; based on sub-genre corpus id sub-corpus id turns, including speaker details/roles c-units, generally with empty <punct type="…" /> elements; text interspersed with freely definable empty elements, e.g. comments, etc.
10
The Proposed Format (II) – Annotation Details
attributes for: speech-act, polarity, semantics, & ‘IFIDs’ syntax ‘container” elements consistent use of empty elements & relevant attributes for other information contingent attribute status for abandoned or interrupted units
11
The Proposed Format (III) – Main Markup Features
syntax categories: yes( response), no( response), d(iscourse )m(arker), q-wh, q-yn, imp(erative), decl(arative), frag(ment), exclam(ation), (term of )address speech acts: taxonomy comprises 70+ categories some combinable see for details still evolving
12
The Proposed Format (IV) – Main Markup Features
modes: Searle’s ‘IFIDs’ semantico-pragmatic key-phrase patterns, i.e. interactional/interpersonal markers generic; evolving topics: semantic key-phrase patterns generic + domain-specific; evolving backchannel info integrated into interlocutor’s turn
13
Adding/Added Feature Information (I)
theoretically unlimited options for adding freely definable empty elements, but overuse may decrease readability currently used: comments related to grammar (e.g. if you are <comment type="grammar" content="missing indefinite determiner" /> student) idiom (e.g. … the <phon sounds_like="ðiː" /> not very far <comment type="idiom" content="unusual pre-modification" /> place …) vocabulary (e.g. … the palace is […] forbidden for the <pause type="short" /> populates <comment type="vocab" content="should be people or general population" /> …) pronunciation in IPA original marked article strong forms (e.g. … the <phon sounds_like="ðiː" /> not very far […]) potentially other idiosyncrasies
14
Adding/Added Feature Information (II)
lengthening (e.g. because er i heard something like that she <phon type="lengthening" /> <punc type="incomplete" />) corrections (e.g. some people have the habit <pause type="medium" /> em <pause type="short" /> of em swimming every <correction orig="everyday" /> day […]) disfluency-related annotation some information already present in mode attribute for repetitions more explicit info can be added manually, as for Quan & Weisser (2015), e.g. when I <pause type="medium" /> when I <recycling type="2WR" start="ADV" count="1" /> got her house we are qualified for <pause type="medium" /> to <replacement type="" target="PP" /> be a teacher
15
Examples of Analysis Potential I – Interactional Speaker Profiling
potential ‘insecurity’ markers ‘initiative’/topic control indicators ‘empty’ interaction signals
16
Examples of Analysis Potential II – Error Profiling
problems with singular/plural distinction problems with determiners
17
Examples of Analysis Potential III – Ngram/Lexical Bundle Analysis
18
References Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. (1999). Longman Grammar of Spoken and Written English. London: Longman. Gilquin, Gaëtanelle, De Cock, Sylvie, & Granger, Sylviane. (Eds.). Louvain International Database of Spoken Interlanguage (LINDSEI). Louvain: UCL Presses Universitaires de Louvain. Quan, Lihong & Weisser, Martin. (2015). A study of ‘self-repair’ operations in conversation by Chinese English learners. System, 49, pp. 39–49. DOI: /j.system Searle, John. (1969). Speech Acts: an Essay in the Philosophy of Language. Cambridge: CUP. Sinclair, John. (2005). Corpus and Text – Basic Principles. In Wynne, M. (Ed.). Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books, pp. 1–16. Available online from Weisser, Martin. (2016a). Practical Corpus Linguistics: an Introduction to Corpus-Based Language Analysis. Oxford: Wiley-Blackwell. Weisser, Martin. (2016b). DART – the Dialogue Annotation and Research Tool. Corpus Linguistics and Linguistic Theory, 12(2), pp DOI: /cllt Weisser, Martin. (forthcoming 2016). Profiling Agents & Callers: a Dual Comparison Across Speaker Roles and British vs. American English. In Pickering, L., Friginal, E., & Staples, S. (Eds.). Talking at Work: Corpus-based Explorations of Workplace Discourse. London: Palgrave Macmillan.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.