Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.

Similar presentations

Presentation on theme: "Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim."— Presentation transcript:

1 Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

2 Mitglied der Leibniz-Gemeinschaft Outline 1)Background: EXMARaLDA, FOLKER, AGD, DGD2 2)Transcription: Data models, data formats, TEI 3)Corpora: Recordings, transcripts, metadata 4)Query requirements 5)Query technologies 6)Demo 7)Future directions

3 Mitglied der Leibniz-Gemeinschaft Background EXMARaLDA: System for building and querying spoken language corpora Used in many individual projects, at the HZSK CLARIN Centre Transcription editor, Corpus management tool, query tool EXAKT FOLKER: Transcription tool – same technical basis, optimised for Research and Teaching Corpus of Spoken German (FOLK)

4 Mitglied der Leibniz-Gemeinschaft Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim Dialect corpora, conversation corpora Database for Spoken German (DGD2): access (browsing and query) for AGD data Background

5 Mitglied der Leibniz-Gemeinschaft Model: Single timeline, multiple tiers Annotation tuples: text label + timeline reference Timeline: fully ordered, reference to a recording Tiers: collections of annotations of a specific category, a specific speaker, annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)

6 Mitglied der Leibniz-Gemeinschaft EXMARaLDA Basic Transcription: (Flat) hierarchy of events in tiers Use of ID and IDREFS to encode temporal relations No additional markup, no deep semantics

7 Mitglied der Leibniz-Gemeinschaft EXMARaLDA ELAN

8 Mitglied der Leibniz-Gemeinschaft EXMARaLDA ELAN Praat

9 Mitglied der Leibniz-Gemeinschaft Data formats Schmidt, Loehr et al. (2008): An exchange format for multimodal annotations. – XML format for data exchange between seven tools with STMT data models improves interoperability for data creation Drawbacks – no document order (non-linear, non-hierachical) – what is the full text / the primary data / the character data? – no explicit representation of dependencies – temporal structure, not linguistic structure bad for querying?

10 Mitglied der Leibniz-Gemeinschaft STMT to OHCO transformation

11 Mitglied der Leibniz-Gemeinschaft STMT to OHCO transformation Segment chain = any temporally connected chain of annotations within one tier Assumption: all other hierarchical structure beneath the level of segment chains Correspondence: segment chain

12 Mitglied der Leibniz-Gemeinschaft

13 Unparsed (EXAKT)Parsed (DGD2)

14 Mitglied der Leibniz-Gemeinschaft Free annotation (EXAKT) Token annotation (DGD2)

15 Mitglied der Leibniz-Gemeinschaft Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1) Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech

16 Mitglied der Leibniz-Gemeinschaft Transcripts, recordings, metadata Interaction metadata – date, genre, place, degree of formality, etc. – pertains to a (set of) transcription(s) Speaker metadata – age, sex, language biography, speech impediments, etc. – pertains to (a) part(s) of a transcription Audio and video recordings – for checking transcription quality – for obtaining information not encoded in transcripts Transcripts – not (the) primary data! – a convenient index into the recording? – selective, theory-dependent, …

17 Mitglied der Leibniz-Gemeinschaft Corpora

18 Mitglied der Leibniz-Gemeinschaft Corpora AGD Corpora: 8 mill. tokens CGN Corpus: 9 mill. tokens BNC Spoken: 10 mill. tokens MICASE: 2 mill. tokens Most other corpora: < 1 mill. Tokens (at least) one order of magnitude smaller than written corpora Query speed is (not that) important

19 Mitglied der Leibniz-Gemeinschaft In informal conversation in Northern Scotland, older female speakers tend to use aye as a backchannel signal with a rising intonation – Situational context Interaction metadata – Speaker metadata – Text data / Surface form Transcript text – Interactional context Temporal transcript structure – Prosodic properties Recording Requirement #1: Access to all types of context Requirement #2: (Manual) postprocessing of query results

20 Mitglied der Leibniz-Gemeinschaft After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated – special word tokens (incomplete words, semi-lexical material, …) – non-word tokens (pauses, non-verbal articulations, …) – temporal measurements (pause length) Requirement #3: Queries for special tokens Requirement #4: Queries with special properties (numerical values, repetition)

21 Mitglied der Leibniz-Gemeinschaft Filled pauses are less frequent in overlapping speech than at the beginning of turns Modal particles and modal adverbs often occur near one another in an utterance vs. Filled pauses occur more frequently near another speakers backchannel Requirement #5: Queries for position in temporal structure Requirement #6: Multiple distance measures, query scopes […]

22 Mitglied der Leibniz-Gemeinschaft Requirements Access to all types of context Manual post-processing of query results Queries for special tokens Queries with special properties Queries for position in temporal structure Multiple distance measures, query scopes …

23 Mitglied der Leibniz-Gemeinschaft Recordings Metadata Transcripts Corpus Query Query result Context Postprocessing

24 Mitglied der Leibniz-Gemeinschaft EXAKT – Regular expression on full text of – (XPath on with markup) – (XSL on transcripts) DGD2 – Oracle full text on documents – SQL on with attributes

25 Mitglied der Leibniz-Gemeinschaft Demo 1: EXAKT with HaMaTaC corpus HaMaTaC: Hamburg Map Task Corpus – advanced L2 learners of German – solving a map task – Orthographic transcription with lemma, POS, disfluency annotation

26 Mitglied der Leibniz-Gemeinschaft Demo 2: DGD2 with FOLK Corpus FOLK: Research & Teaching Corpus of Spoken German

27 Mitglied der Leibniz-Gemeinschaft Future directions: – Support a real query language: CQL – CQPWeb as a test case – User survey DGD2 (approaching 2000 users!) – … – TEI as common ground for different spoken language corpora query platforms? for querying spoken and written data side-by-side?

Download ppt "Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim."

Similar presentations

Ads by Google