Download presentation
Presentation is loading. Please wait.
Published byAbraham Shepherd Modified over 9 years ago
2
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye Corpus (VIC)
3
TIMIT Read phonetically balanced sentences –Good coverage of different phonetic environments –Does not exhibit more radical reductions, dysfluencies seen in spontaneous speech Transcribers started from forced alignments, realigned Roughly 5 hours of speech –630 speakers, 8 dialects, 10 sentences apiece Uses ARPAbet symbols –Separate stop/closure symbols –Symbol for epenthetic stop Cost: $100 for non-1993 LDC members
4
TIMIT
5
BU Radio Corpus Radio announcers reading news –4 male, 3 female; reading in both “non-studio” and “studio” voices Originally intended for speech synthesis work –Marked with prosody in addition to phonetics –Marked with ARPAbet (similar to TIMIT) > 7 hours of speech Cost: $400 for non-1996/1997 LDC members
6
BU Radio Corpus
7
Switchboard ICSI Transcriptions Spontaneous speech, many dialect regions –Transcribed “segmented turns,” some of which may be cutoffs, from 2-party conversations –4 hours of speech transcribed 2 stages: –Initial 1 hour phonetically transcribed –Hours 2-4 phonetic markers, syllable boundaries -- back aligned with phonetic markers Similar phoneset to TIMIT –No separate closure/release –Voiced hesitations (pn/pv) Cost: possibly free, possibly $2k for non-1993/7
8
Switchboard ICSI Transcriptions
9
VIC (Buckeye) Corpus Spontaneous interview speech –Age, gender balanced –All speakers from Ohio Currently in transcription –NIH grant involving Keith, me, and Mark Pitt –10 hours completed, 30 hours total Based on ARPAbet with a few additions –Nasalized vowels, glottal stop replacing /t/,… Cost: free (to us) -- might need to work out licensing but shouldn’t be an issue.
10
VIC (Buckeye) Corpus
11
Evaluating with Corpora Clear thing to do is to start with TIMIT –Facilitates comparison with other things However, we should really try to insert spontaneous data into research ASAP –Maybe move to some combination of TIMIT/SWB/VIC? Only talked about (American) English –Other languages in year 4? –Chin has done some work in Mandarin? CASS corpus: phonetically transcribed, but available?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.