Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Slides:



Advertisements
Similar presentations
Mitglied der 1 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim Gasch
Advertisements

IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
IRCS Workshop on Linguistic Databases, December 2001 EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
XML: Extensible Markup Language
Information Provided in Adult- Child Discourse about the Meaning of Adjectives Roberta Corrigan University of Wisconsin- Milwaukee.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Annotation, Alignment and Transcription: An extremely brief and basic introduction to Elan and Transcriber OLAC Tutorial at the Linguist Society of America.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
Zum Aufbau eines multimedialen Spracharchivs Dagmar Jung (Institut für Linguistik, Allgemeine Sprachwissenschaft, Universität zu Köln) CCeH Eröffnungsworkshop.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Spoken multimedia corpora for pedagogical purposes Sabine Braun (University of Surrey) Pascual Pérez-Paredes (Universidad de Murcia) Ylva Berglund (Oxford.
1 COS 425: Database and Information Management Systems XML and information exchange.
MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.
Database „Multilingualism“ – Perspectives for collaborative corpus construction and collaborative commentary Thomas Schmidt Sonderforschungsbereich 538.
Information Retrieval in Practice
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
JSP Standard Tag Library
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
An Approach to Task Modelling for User Interface Design Costin Pribeanu National Institute for Research and Development in Informatics, Bucureşti, Romania.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Hands-on tutorial: Using Praat for analysing a speech corpus Mietta Lennes Palmse, Estonia Department of Speech Sciences University of Helsinki.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Information Technology – Dialogue Systems Ulm University (Germany) Speech Data Corpus for Verbal Intelligence Estimation.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
SPEECH AND WRITING. Spoken language and speech communication In a normal speech communication a speaker tries to influence on a listener by making him:
Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail:
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
JSTL The JavaServer Pages Standard Tag Library (JSTL) is a collection of useful JSP tags which encapsulates core functionality common to many JSP applications.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
Jan Christoph Meister University of Hamburg
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
DocLing2016 Software Tools Peter K. Austin Department of Linguistics SOAS, University of London
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
Hibernate Online Training. Introduction to Hibernate Hibernate is a high-performance Object-Relational persistence and query service which takes care.
A Database of Narrative Schemas A 2010 paper by Nathaniel Chambers and Dan Jurafsky Presentation by Julia Kelly.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
CLARIN ERIC Franciska de Jong Oxford April 2016
Querying GrAF data in linguistic analysis
Visual Information Retrieval
Chapter 1: Introduction
Database.
Hands-on tutorial: Using Praat for analysing a speech corpus
Metadata in Digital Preservation: Setting the Scene
Emer Gilmartin, Carl Vogel, ADAPT Centre Trinity College Dublin
Using GOLD to Tracking L2 Development
Presentation transcript:

Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Mitglied der Leibniz-Gemeinschaft Outline 1)Background: EXMARaLDA, FOLKER, AGD, DGD2 2)Transcription: Data models, data formats, TEI 3)Corpora: Recordings, transcripts, metadata 4)Query requirements 5)Query technologies 6)Demo 7)Future directions

Mitglied der Leibniz-Gemeinschaft Background EXMARaLDA: System for building and querying spoken language corpora Used in many individual projects, at the HZSK CLARIN Centre Transcription editor, Corpus management tool, query tool EXAKT FOLKER: Transcription tool – same technical basis, optimised for Research and Teaching Corpus of Spoken German (FOLK)

Mitglied der Leibniz-Gemeinschaft Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim Dialect corpora, conversation corpora Database for Spoken German (DGD2): access (browsing and query) for AGD data Background

Mitglied der Leibniz-Gemeinschaft Model: Single timeline, multiple tiers Annotation tuples: text label + timeline reference Timeline: fully ordered, reference to a recording Tiers: collections of annotations of a specific category, a specific speaker, annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)

Mitglied der Leibniz-Gemeinschaft EXMARaLDA Basic Transcription: (Flat) hierarchy of events in tiers Use of ID and IDREFS to encode temporal relations No additional markup, no deep semantics

Mitglied der Leibniz-Gemeinschaft EXMARaLDA ELAN

Mitglied der Leibniz-Gemeinschaft EXMARaLDA ELAN Praat

Mitglied der Leibniz-Gemeinschaft Data formats Schmidt, Loehr et al. (2008): An exchange format for multimodal annotations. – XML format for data exchange between seven tools with STMT data models improves interoperability for data creation Drawbacks – no document order (non-linear, non-hierachical) – what is the full text / the primary data / the character data? – no explicit representation of dependencies – temporal structure, not linguistic structure bad for querying?

Mitglied der Leibniz-Gemeinschaft STMT to OHCO transformation

Mitglied der Leibniz-Gemeinschaft STMT to OHCO transformation Segment chain = any temporally connected chain of annotations within one tier Assumption: all other hierarchical structure beneath the level of segment chains Correspondence: segment chain

Mitglied der Leibniz-Gemeinschaft

Unparsed (EXAKT)Parsed (DGD2)

Mitglied der Leibniz-Gemeinschaft Free annotation (EXAKT) Token annotation (DGD2)

Mitglied der Leibniz-Gemeinschaft Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1) Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech

Mitglied der Leibniz-Gemeinschaft Transcripts, recordings, metadata Interaction metadata – date, genre, place, degree of formality, etc. – pertains to a (set of) transcription(s) Speaker metadata – age, sex, language biography, speech impediments, etc. – pertains to (a) part(s) of a transcription Audio and video recordings – for checking transcription quality – for obtaining information not encoded in transcripts Transcripts – not (the) primary data! – a convenient index into the recording? – selective, theory-dependent, …

Mitglied der Leibniz-Gemeinschaft Corpora

Mitglied der Leibniz-Gemeinschaft Corpora AGD Corpora: 8 mill. tokens CGN Corpus: 9 mill. tokens BNC Spoken: 10 mill. tokens MICASE: 2 mill. tokens Most other corpora: < 1 mill. Tokens (at least) one order of magnitude smaller than written corpora Query speed is (not that) important

Mitglied der Leibniz-Gemeinschaft In informal conversation in Northern Scotland, older female speakers tend to use aye as a backchannel signal with a rising intonation – Situational context Interaction metadata – Speaker metadata – Text data / Surface form Transcript text – Interactional context Temporal transcript structure – Prosodic properties Recording Requirement #1: Access to all types of context Requirement #2: (Manual) postprocessing of query results

Mitglied der Leibniz-Gemeinschaft After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated – special word tokens (incomplete words, semi-lexical material, …) – non-word tokens (pauses, non-verbal articulations, …) – temporal measurements (pause length) Requirement #3: Queries for special tokens Requirement #4: Queries with special properties (numerical values, repetition)

Mitglied der Leibniz-Gemeinschaft Filled pauses are less frequent in overlapping speech than at the beginning of turns Modal particles and modal adverbs often occur near one another in an utterance vs. Filled pauses occur more frequently near another speakers backchannel Requirement #5: Queries for position in temporal structure Requirement #6: Multiple distance measures, query scopes […]

Mitglied der Leibniz-Gemeinschaft Requirements Access to all types of context Manual post-processing of query results Queries for special tokens Queries with special properties Queries for position in temporal structure Multiple distance measures, query scopes …

Mitglied der Leibniz-Gemeinschaft Recordings Metadata Transcripts Corpus Query Query result Context Postprocessing

Mitglied der Leibniz-Gemeinschaft EXAKT – Regular expression on full text of – (XPath on with markup) – (XSL on transcripts) DGD2 – Oracle full text on documents – SQL on with attributes

Mitglied der Leibniz-Gemeinschaft Demo 1: EXAKT with HaMaTaC corpus HaMaTaC: Hamburg Map Task Corpus – advanced L2 learners of German – solving a map task – Orthographic transcription with lemma, POS, disfluency annotation

Mitglied der Leibniz-Gemeinschaft Demo 2: DGD2 with FOLK Corpus FOLK: Research & Teaching Corpus of Spoken German

Mitglied der Leibniz-Gemeinschaft Future directions: – Support a real query language: CQL – CQPWeb as a test case – User survey DGD2 (approaching 2000 users!) – … – TEI as common ground for different spoken language corpora query platforms? for querying spoken and written data side-by-side?