UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014
Outline UAM CorpusTool (O’Donnell, 2008) Tool description A short tutorial Annotating signals of coherence relations by UAM CorpusTool Feb 5, 2014 Discourse Research Group 2
UAM CorpusTool Created by Mick O’Donnell in 2008 Replaces prior software Systemic Coder which allowed coding of single documents at a single layer Available at Runs on Windows and Mac OS “… primarily aimed at the linguist or computational linguist who does not program, and would rather spend their time annotating text than learning how to use the system.” (O’Donnell, 2008: 13) Feb 5, 2014 Discourse Research Group 3
UAM CorpusTool Annotate documents text type, writer characteristics, register, etc. Annotate segments Tagging sections of a text by function (abstract, introduction, body, conclusion) Tagging sentences (active/passive; simple/ complex) or clauses (relative/imperative/non-finite) Semantic or pragmatic annotation (synonymy/antonymy; speech acts) Tagging POS (noun, verbs, adjective) Automatic grammar analysis (English only) using Stanford parser Rhetorical structure annotation Feb 5, 2014 Discourse Research Group 4
Annotation in UAM CorpusTool Main Steps Start a new project Add (an) annotation layer(s) You can use some pre-built annotation schemes or design your own Add file Import.txt files and Incorporate them Annotate Feb 5, 2014 Discourse Research Group 5
Annotation in UAM CorpusTool Main Window Screenshot Feb 5, 2014 Discourse Research Group 6
Annotation in UAM CorpusTool Annotation Scheme Screenshots Feb 5, 2014 Discourse Research Group 7
Annotation in UAM CorpusTool Document Coding Screenshot Feb 5, 2014 Discourse Research Group 8
Annotation in UAM CorpusTool Segment Coding Screenshot Feb 5, 2014 Discourse Research Group 9
Other Components Search Autocode Statistics Explore Options Help Feb 5, 2014 Discourse Research Group 10
Annotating Signals of Coherence Relations Goal Annotate signals of coherence relations Signals of coherence relations E.g., John is tall, but Mary is short. One straightforward signal: the discourse marker ‘but’ Also, there are two more signals Antonyms (tall ~ short) Parallel syntactic constructions (subj – copula – adj) Feb 5, 2014 Discourse Research Group 11
Annotating Signals of Coherence Relations Annotate the RST Discourse Treebank (Carlson et al., 2002) Contains 385 documents from The Wall Street Journal articles Texts in those articles are annotated already for rhetorical (coherence) relations Approx. 22,000 discourse units and 17,000 relations in total Feb 5, 2014 Discourse Research Group 12
Annotating Signals of Coherence Relations Requirements from an annotation tool Importability Relevant data to be imported into the tool Annotation Scheme Support for three-level hierarchical taxonomy Customizability Easy access to the annotation scheme for editing Multiple Annotations Two or more tags for a single element Convertibility XML output Simplicity No advanced computational knowledge Graphical interface Feb 5, 2014 Discourse Research Group 13
Signalling Annotation by UAM CorpusTool Problem with Importing data UAM CorpusTool supports RST annotation and can directly import RST files However, it cannot provide layered annotation on top of the RST-level structure Solution to the problem Convert RST base files from LISP to text format Import the converted files This retains discourse structures and all relational information Feb 5, 2014 Discourse Research Group 14
Signalling Annotation by UAM CorpusTool How did we do the rest? Feb 5, 2014 Discourse Research Group 15
Signalling Annotation by UAM CorpusTool Annotation Scheme Screenshot Feb 5, 2014 Discourse Research Group 16
Signalling Annotation by UAM CorpusTool Annotation Window Screenshot Feb 5, 2014 Discourse Research Group 17
References Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST Discourse Treebank, LDC2002T07 [Corpus]. Philadelphia, PA: Linguistic Data Consortium. O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. Paper presented at the XXVI Congreso de AESLA, Almeria, Spain. Feb 5, 2014 Discourse Research Group 18
Thank You! Feb 5, 2014 Discourse Research Group 19