Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail:

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

1 Using the HTK speech recogniser to analyse prosody in a corpus of German spoken learners English Toshifumi Oba, Eric Atwell University of Leeds, School.
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
Introducing.. Postcode Express and Postcode Professional Windows Desktop Mapping Applications April 2008.
Natural Language Understanding Difficulties: Large amount of human knowledge assumed – Context is key. Language is pattern-based. Patterns can restrict.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Creating Map Books ArcMap 10 Data Driven Pages
CPIT 102 CPIT 102 CHAPTER 1 COLLABORATING on DOCUMENTS.
The English of Philippine Call Centre Agents Kingsley Bolton & Martin Weisser City University of Hong Kong.
Results ISI Variance in STP Corpus ISI Variance in BU Corpus * p
Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Towards Learning Dialogue Structures from Speech Data and Domain Knowledge: Challenges to Conceptual Clustering using Multiple and Complex Knowledge Source.
AN OVERVIEW OF MAC PDF TOOLS 1. PDF Tools for Mac PDF files can be used either in Windows, Unix or Apple’s Mac OS operating system commonly. It still.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Word Processing Features Adapted from CTAERN Curriculum PROFITT Curriculum Basic Computer Skills Module 3 Introduction to Microsoft Word.
Prosody and NLP Seminar by Nikhil: Adith: Prachur: 06D05011 We have a presentation this Friday ?
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Discourse Level Software Current Status and Future Directions Nov. 16, 2004 Lars Huttar Knowledge Management Services.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Assisting cloze test making with a web application Ayako Hoshino ( 星野綾子 ) Hiroshi Nakagawa ( 中川裕志 ) University of Tokyo ( 東京大学 ) Society for Information.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Funded by NIH grant RO1 HD-4152 to J. Arnold NSF BCS and NSF BCS to Z. Griffin Why do speakers modulate acoustic prominence? Listener-oriented.
What is a text? Write a brief definition of what a text is.
3rd International Symposium on Teaching English at Tertiary Level Hong Kong, 9-10 June 2007 Jointly organised by: Department of English, The Hong Kong.
AUTOMATIC DETECTION OF REGISTER CHANGES FOR THE ANALYSIS OF DISCOURSE STRUCTURE Laboratoire Parole et Langage, CNRS et Université de Provence Aix-en-Provence,
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies martinweisser.org.
Microsoft Word 2000 Presentation 2 Microsoft Word Topics  Tools –Spelling/Grammar Check –Thesaurus –AutoCorrect –Word Count –Change Case –Background.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
O Supervisor : Dr. Harold Boley o Advisor : Dr. Tara Athan o Team : Simranjit Singh Pratik Shah Bijiteshwar R Aayush.
Learning process, strategies, and web-based concordancers: a case study 指導教授 : 陳 明 溥 研 究 生 : 許 良 村 Sun, Y. C. (2003). Learning process, strategies, and.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
SAN FRANCISCO COUNTY TRANSPORTATION AUTHORITY DTA Anyway: Code Base & Network Development Lisa Zorn DTA Peer Review Panel Meeting July 25 th, 2012.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
MedKAT Medical Knowledge Analysis Tool December 2009.
Word Editing Tools. Word Automatic Editing Tools §Word has three features that automatically change or insert text and graphics as you type §You can easily.
Pragmatics and Text Analysis Chapter 6.  concerned with the how meaning is communicated by the speaker (writer) and interpreted by the listener (reader)
GEO375 Final Project: From Txt to Geocoded Data. Goal My Final project is to automate the process of separating, geocoding and processing 911 data for.
Text Annotation By: Harika kode Bala S Divakaruni.
Bogor-Java Environment for Eclipse MSE Presentation III Yong Peng.
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Hungarian Academy of Sciences
Updating the LINDSEI for Pragmatic Analysis and Evaluation
Towards Emotion Prediction in Spoken Tutoring Dialogues
Computational and Statistical Methods for Corpus Analysis: Overview
Guide To UNIX Using Linux Third Edition
Word Editing Tools.
Multimedia Information Retrieval
THE NATURE OF SPEAKING Joko Nurkamto UNS Solo.
Part of the Multilingual Web-LT Program
Lecture Set 3 Introduction to Visual Basic Concepts
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Automatic Detection of Causal Relations for Question Answering
Clip & Convert to ASCII Program Kelly Knapp Spring 2010
Presentation transcript:

Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail: martinweisser.org

Outline The Conversion Process Pre-processing Requirements Annotation & Post-processing Searching & Exploring the Corpus Conclusion

The Conversion Process I – Issues how to convert to DART XML format? –identify original conventions some documented in Cheng et al. (2008) some undocumented  –use tone unit marking? unfortunately tone units in Brazil’s system for ‘discourse intonation’ ≠ C-units  → no ‘sentence’ intonation inferable directly –remove prosodic information, apart from stress and tone movements, to ensure readability –handle overlap exact extent not marked or inferable  → better to delete –etc.

The Conversion Process II – Original Format

The Conversion Process III – the Conversion Editor original input file conversion result view conversion script editor save output

The Conversion Process IV – Conversion Results converted to DART XML format retained stress marking converted & moved tone marking converted ‘non-speech’ to comments added gender attribute added speaker type attribute moved pauses to next turn

Pre-processing Requirements creating new resources in/for DART –adapt DART modules to handle mixed case –‘synthesise’ domain-specific lexicon –create domain-specific topic ‘thesaurus’ pre-processing –fix conversion errors –identify/mark incomplete words –split turns –add punctuation, partly based on original prosodic features –etc.

Annotation & Post-processing I – Steps annotation in DART –fully automated –less than 80 sec for 24 files ~72,100 words ~10,300 C-Units Post-processing to fix potential errors on the levels of –syntax: potentially missing syntax rules –pragmatics: missing inferencing rules or modes (‘IFIDs’) –semantics: incorrectly identified topics

Annotation & Post-processing II – Annotation Result identified syntactic category automatically split off DM annotated identifiable speech acts

Searching the Corpus easily searchable via DART –speech act stats hyperlinked to concordancer –formulaic patterns or disfluencies via n-grams –manual searches in concordancer for specific speech acts syntactic categories + speech acts speech acts + speaker types speech acts + gender responses to questions searches for specific tone features

Conclusion DART annotation enriches the HKCSE through –adding syntactic and pragmatic annotation –ability to analyse features based on (functional) C- units, rather than intonation units –new search options based on the above features

References Cheng, W. Greaves, C. and Warren, M A Corpus- driven Study of Discourse Intonation: the Hong Kong Corpus of Spoken English (prosodic). Amsterdam/Philadelphia: John Benjamins. Weisser, M Annotating Dialogue Corpora Semi- Automatically: a Corpus-Linguistic Approach to Pragmatics. Unpublished Habilitation (professorial) thesis, University of Bayreuth. Weisser, M. 2012; forthcoming Pragmatic annotation. In: Aijmer, K. & Rühlemann, C. (Eds.). Corpus Pragmatics: a Handbook. Cambridge: CUP. Weisser, M The DART Manual. Weisser, M. (in progress). DART – the Dialogue Annotation and Research Tool.