Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,

Similar presentations


Presentation on theme: "Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,"— Presentation transcript:

1 Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng (contains electronic versions of papers and links to data) Patterns of Speech Sounds in Unscripted Communication - Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000

2 OR ….

3 How I Learned to Stop Worrying and Use The Canonical Form

4 Disclaimer I am a Phonetician - NOT! (many thanks for the invite)

5 No Scientist is an Island … IMPORTANT COLLEAGUES PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH (SWITCHBOARD) Candace Cardinal, Rachel Coulston, Dan Ellis, Eric Fosler, Joy Holllenback, John Ohala, Colleen Richey STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION Eric Fosler, Leah Hitchcock, Joy Hollenback ARTICULATORY-ACOUSTIC BASIS OF CONSONANT RECOGNITION Leah Hitchcock, Rosaria Silipo AUTOMATIC PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH Shawn Chang, Lokendra Shastri

6 Germane Publications http://www.icsi.berkeley.edu/~steveng STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany. Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176. Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATION Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing. Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.

7 Prologue

8 Language - The Traditional Perspective The “classical” view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization Phonetic orthography

9 Language - A Syllable-Centric Perspective A more empirical perspective of spoken language focuses on the syllable as the interface between “sound” and “meaning” Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

10 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES

11 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)

12 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL

13 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time

14 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time

15 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form

16 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely

17 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position

18 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position –Therefore, it is important to model spoken language at the syllabic level

19 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position –Therefore, it is important to model spoken language at the syllabic level THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY

20 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position –Therefore, it is important to model spoken language at the syllabic level THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY –It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

21 Take Home Messages SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position –Therefore, it is important to model spoken language at the syllabic level THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY –It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material

22 Take Home Messages PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT INFORMATION CONTENT Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany.

23 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH

24 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material

25 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments

26 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words

27 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables

28 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables –Articulatory-acoustic features

29 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables –Articulatory-acoustic features PERCEPTUAL EVIDENCE –The articulatory-acoustic basis of consonant recognition

30 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables –Articulatory-acoustic features PERCEPTUAL EVIDENCE –The articulatory-acoustic basis of consonant recognition –Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition

31 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables –Articulatory-acoustic features PERCEPTUAL EVIDENCE –The articulatory-acoustic basis of consonant recognition –Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition COMPUTATIONAL METHODS –Automatic methods for phonetic transcription based on articulatory-acoustic features

32 Road Map PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH –Provides the basis for the statistical analyses of spontaneous material A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF: –Phonetic segments –Words –Syllables –Articulatory-acoustic features PERCEPTUAL EVIDENCE –The articulatory-acoustic basis of consonant recognition –Not all articulatory-acoustic features are created equal - place-of-articulation cues appear to be most important for consonant recognition COMPUTATIONAL METHODS –Automatic methods for phonetic transcription based on articulatory-acoustic features –Is the most likely means through which it will be possible to generate sufficient empirical data with which to rigorously test hypotheses germane to spoken language

33 Phonetic Transcription of Spontaneous (American) English

34 Phonetic Transcription of Spontaneous English TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION - SWITCHBOARD AMOUNT OF MATERIAL MANUALLY TRANSCRIBED –3 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods) – 1 hour labeled and segmented at the phonetic-segment level DIVERSITY OF MATERIAL TRANSCRIBED –Spans speech of both genders (ca. 50/50%) reflecting a wide range of American dialectal variation (6 regions + “army brat”), speaking rate and voice quality TRANSCRIBED BY WHOM? –7 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of the corpus was transcribed by three individuals out of the original eight –Supervised by Steven Greenberg and John Ohala TRANSCRIPTION SYSTEM –A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd HOW LONG DOES TRANSCRIPTION TAKE? (Don’t Ask!) –388 times real time for labeling and segmentation at the phonetic-segment level –150 times real time for labeling phonetic segments and segmenting syllables HOW WAS LABELING AND SEGMENTATION PERFORMED? –Using a display of the signal waveform, spectrogram, word transcription and “forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations DATA AVAILABLE AT - http://www.icsi/berkeley.edu/real/stp

35 A Brief Tour of Pronunciation Variation in Spontaneous American English

36 The 10 most common words account for 27% of the corpus The 100 most common words account for 67% of the corpus The 1000 most common words account for 92% of the corpus Thus, most informal dialogues are composed of a relatively small number of common words. However, it is the infrequent words that typically provide the precision and detail required for complex information transfer Cumulative Word Frequency in English 67% 27% 92% Computed from the Switchboard corpus (American English telephone dialogues) Focus on 100 most common words

37 How Many Pronunciations of “And”? NPronunciationN

38 How Many Pronunciations of “And”? NPronunciationN

39 How Many Different Pronunciations? RankWordN#Pron Most Common Pronunciation MCP %Total

40 How Many Different Pronunciations? RankWordN#Pron Most Common Pronunciation MCP %Total

41 RankWordN#Pron Most Common Pronunciation MCP %Total How Many Different Pronunciations?

42 RankWordN#Pron Most Common Pronunciation MCP %Total

43 How Many Different Pronunciations? RankWordN#Pron Most Common Pronunciation MCP %Total

44 English is (sort of) like Chinese …. 81% of the word tokens are monosyllabic Of the 100 most common words, 90 are one syllable in length Only 22% of the words in the lexicon are one syllable long Hence, there is a decided preference for monosyllablic words in informal discourse 95% of the words contain just ONE or TWO syllables ….

45 Syllable and. Word Frequencies are Similar Words and syllables exhibit similar distributions over the 300 most common elements, accounting for 80% of the corpus The similarity of their distributions is a consequence of most words consisting of just a single syllable

46 Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10 Word Frequency in Spontaneous English Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc. Computed from the Switchboard corpus (American English telephone dialogues)

47 Information Affects Pronunciation The faster the speaking rate the more likely that the pronunciation deviates from canonical However, the effect is much more pronounced for the 100 most common words than for more infrequent words From Fosler, Greenberg and Morgan (1999); Greenberg and Fosler (2000)

48 English Syllable Structure is (sort of) Like Japanese 87% of the pronunciations are simple syllabic forms 84% of the canonical corpus is composed of simple syllabic forms n= 103, 054 Most syllables are simple in form (no consonant clusters)

49 There are many “complex” syllable forms (consonant clusters, but all occur relatively infrequently Complex Syllables are Important, Though Thus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex Complex codas are not as frequently realized in actual pronunciation as their canonical representation Complex onsets tend to preserve the canonical pronunciation in realize their canonical representation n= 17,760

50 Syllable-Centric Pronunciation (Spontaneous speech) (Read Sentences) “Cat” [k ae t] [k] = onset [ae] = nucleus [t] = coda Onsets are pronounced canonically far more often than nuclei or codas Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues Percent Canonically Pronounced Syllable Position n= 120,814

51 Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation (Spontaneous speech) (Read Sentences) Percent Canonically Pronounced Syllable Onset Type Complex Onsets are Highly Canonical

52 Speaking Style Affects Codas Percent Canonically Pronounced Codas are much more likely to be realized canonically in formal than in spontaneous speech Syllable Coda Type

53 Onsets (but not Codas) Affect Nuclei Percent Canonically Pronounced The presence of a syllable onset has a substantial impact on the realization of the nucleus

54 Syllable-Centric Feature Analysis Place of articulation deviates most in nucleus position Manner of articulation deviates most in onset and coda position Voicing deviates most in coda position Phonetic deviation along a SINGLE feature Place deviates very little from canonical form in the onset and coda. It is a STABLE AF in these positions Place is VERY unstable in nucleus position

55 Articulatory PLACE Feature Analysis Place of articulation is a “dominant” feature in nucleus position only Drives the feature deviation in the nucleus for manner and rounding Phonetic deviation across SEVERAL features Place “carries” manner and rounding in the nucleus

56 Manner of articulation is a “dominant” feature in onset and coda position Drives the feature deviation in onsets and codas for place and voicing Articulatory MANNER Feature Analysis Manner is less stable in the coda than in the onset Manner drives place and voicing deviations in the onset and coda Phonetic deviation across SEVERAL features

57 Voicing is a subordinate feature in all syllable positions Its deviation pattern is controlled by manner in onset and coda positions Articulatory VOICING Feature Analysis Voicing is unstable in coda position and is dominated by manner Phonetic deviation across SEVERAL features

58 Lip-rounding is a subordinate feature Its deviation pattern is driven by the place feature in nucleus position LIP-ROUNDING Feature Analysis Rounding is stable everywhere except in the nucleus where its deviation pattern is driven by place Phonetic deviation across SEVERAL features

59 Perceptual Evidence for the Importance of Place (and Manner) of Articulation Features

60 Spectral Slit Paradigm

61 Consonant Recognition - Single Slits

62 Consonant Recognition - 1 Slit

63 Consonant Recognition - 2 Slits

64 Consonant Recognition - 3 Slits

65 Consonant Recognition - 4 Slits

66 Consonant Recognition - 5 Slits

67 Consonant Recognition - 2 Slits

68

69

70

71

72

73 Consonant Recognition - 3 Slits

74

75

76 Consonant Recognition - 4 Slits

77 Consonant Recognition - 5 Slits

78 Correlation - AFs/Consonant Recognition Consonant recognition is almost perfectly correlated with place of articulation performance This correlation suggests that the place feature is based on cues distributed across the entire speech bandwidth, in contrast to other features Manner is also highly correlated with consonant recognition, voicing and rounding less so

79 Automatic Phonetic Transcription of Spontaneous Speech

80 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS

81 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries

82 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA

83 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA –Manual labeling and segmentation typically requires 150-400 times real time to perform

84 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA –Manual labeling and segmentation typically requires 150-400 times real time to perform WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL

85 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA –Manual labeling and segmentation typically requires 150-400 times real time to perform WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL –Such material will be extremely useful for developing pronunciation models and new algorithms for ASR

86 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA –Manual labeling and segmentation typically requires 150-400 times real time to perform WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL –Such material will be extremely useful for developing pronunciation models and new algorithms for ASR THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS (OGI Numbers Corpus) WITH ca. 83% ACCURACY

87 Automatic Phonetic Transcription MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC ALIGNMENT DATA TO TRAIN NEW SYSTEMS –These materials are highly inaccurate (35-50% incorrect labeling of phonetic segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA –Manual labeling and segmentation typically requires 150-400 times real time to perform WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL –Such material will be extremely useful for developing pronunciation models and new algorithms for ASR THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS (OGI Numbers Corpus) WITH ca. 83% ACCURACY –The algorithms used are capable of achieving ca. 93% accuracy with only minor changes to the models

88 Phonetic Feature Classification System

89 Spectro-Temporal Profile (STeP) STePs provide a simple, accurate means of delineating the acoustic properties associated with phonetic features and segments Vocalic

90 Spectro-temporal Profile (STeP) STePs incorporate information about the instantaneous modulation spectrum distributed across the (tonotopic) frequency axis and can be used for training neural networks. Fricative

91 Label Accuracy per Frame Frames away from the boundary are labeled very accurately

92 Sample Transcription Output The automatic system performs very similarly to manual transcription in terms of both labels and segmentation –11 ms average concordance in segmentation –83% concordance with respect to phonetic labels

93 In Conclusion ….

94 Grand Summary SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES –Such insights can only be obtained at present with large amounts of phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus) –Automatic methods will eventually supply badly needed data for more complete analyses and evaluation THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL –Onsets are pronounced in canonical (i. e., dictionary) fashion 85-90% of the time –Nuclei and codas are expressed canonically only 60% of the time –Nuclei tend to be realized as vowels different from the canonical form –Codas are often deleted entirely –Articulatory-acoustic features are also organized in systematic fashion with respect to syllabic position –Therefore, it is important to model spoken language at the syllabic level THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY –It may be unrealistic to assume that any phonetic transcription based exclusively on segments (such as the IPA) is truly capably of capturing the important phonetic detail of spontaneous material

95 That’s All, Folks Many Thanks for Your Time and Attention

96 Temporal View of Language

97 Linguistic Automatic Speech Recognition CHARACTERIZE SPOKEN LANGUAGE WITH GREAT PRECISION –Currently, manual transcription is the only means by which to collect detailed data pertaining to spoken language. Computational methods are currently being developed to perform transcription automatically in order to provide an abundance of data for statistical characterization of spontaneous discourse. USE THIS KNOWLEDGE TO DEVELOP COMPUTATIONAL TECHNIQUES TAILORED TO THE PROPERTIES OF THE SPEECH DOMAIN – A detailed knowledge of spoken language is essential for deriving a computational framework for ASR. The phonetic properties of speech are structured in different ways depending on the location within the syllable, word and phrase. Such knowledge is currently under-utilized by mainstream ASR. FOCUS ON LOWER TIERS OF SPOKEN LANGUAGE FOR THE PRESENT – It is fashionable to emphasize the importance of “language” models (i.e., word co-occurrence properties) in ASR. However, most of the problems lie in the acoustic-phonetic front end and therefore this domain should be attacked first. USE KNOWLEDGE OF HOW HUMAN LISTENERS UNDERSTAND SPOKEN LANGUAGE TO GUIDE DEVELOPMENT OF ASR ALGORITHMS – Current ASR acoustic models are not based on perceptual capabilities of human listeners, but on a distorted representation of what is important in hearing. It is important to perform intelligibility experiments to ascertain the identity of the truly important components of the speech signal and use this knowledge to develop robust, acoustic-front-end models for ASR.

98 Linguistic ASR Research @ ICSI PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY – Human listening experiments identifying the specific properties crucial for understanding spoken language MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITION – Using auditory-based algorithms (linked to the syllable) for reliable ASR in background noise and reverberation SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITION – Development of a syllable-based decoder for ASR STATISTICAL PROPERTIES OF SPONTANEOUS SPEECH – Detailed and comprehensive statistical analyses of the Switchboard corpus pertaining to phonetic, prosodic and lexical properties, used for developing pronunciation models (among other things) AUTOMATIC PHONETIC LABELING AND SEGMENTATION – Development of (the first) automatic phonetic transcription system using articulatory-acoustic features (e.g, voicing, manner, place etc.) AUTOMATIC LABELING OF PROSODIC STRESS – Development of (the first) automatic system for labeling prosodic stress in English AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION – Detailed and comprehensive analyses of Switchboard-corpus ASR systems in order to identify factors associated with word error

99 Linguistic ASR at ICSI SENIOR PERSONNEL –Steven Greenberg - Linguistic ASR, Spoken Language Statistics, Speech Perception –Lokendra Shastri - Neural Network Design, Higher-level Language & Neural Processing GRADUATE STUDENTS –Shawn Chang - ANN-based ASR, Automatic Phonetic Transcription & Segmentation –Michael Shire - Temporal & Multi-Stream Approaches to Automatic Speech Recognition –Mirjam Wester - Pronunciation Modeling in Automatic Speech Recognition UNDERGRADUATE STUDENTS –Micah Farrer - Database Development for ASR Analysis –Leah Hitchcock - Statistics of Pronunciation and Prosody of Spoken Language TECHNICAL STAFF –Joy Hollenback - Statistical Analyses, Data Collection and Maintenance ASSOCIATES AT ICSI – Hynek Hermansky, Nelson Morgan, Liz Shriberg and Andreas Stolcke ASSOCIATES AT LOCATIONS OTHER THAN ICSI –Takayuki Arai (Sophia University, Tokyo) - Speech Perception, Signal Processing –Les Atlas (University of Washington, Seattle) - Acoustic Signal Processing –Ken Grant (Walter Reed Army Medical Center) - Audio-visual Speech Processing –David Poeppel (University of Maryland) - Brain Mechanisms of Language –Tim Roberts (UC-San Francisco Medical Center) - Brain Imaging of Language Processes –Christoph Schreiner (UCSF) - Auditory Cortex and Its Relation to Speech Processing –Lloyd Watts (Applied Neurosystems) - Auditory Modeling from Cochlea to Cortex CURRENT FUNDING –National Security Agency - Automatic Transcription of Phonetic and Prosodic Elements –National Science Foundation - Syllable-based ASR, Speech Perception, Statistics of Speech

100 Linguistic ASR at ICSI (continued) FORMER ICSI POST-DOCTORAL FELLOWS –Takayuki Arai - Sophia University, Tokyo –Dan Ellis - Columbia University (as of September 1, 2000) –Eric Fosler - Bell Laboratories, Lucent Technologies –Rosaria Silipo - Nuance Communications FORMER ICSI GRADUATE STUDENTS –Jeff Bilmes - University of Washington, Seattle –Eric Fosler - Bell Laboratories, Lucent Technologies –Brian Kingsbury - IBM, Yorktown Heights –Katrin Kirchhoff - University of Washington, Seattle –Nikki Mirghafori - Nuance Communications –Su-Lin Wu - Nuance Communications FORMER ICSI UNDERGRADUATE STUDENTS –Candace Cardinal - Nuance Communications –Rachel Coulston - University of California, San Diego –Collen Richey - Stanford University

101 Publications - Linguistic ASR AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION Greenberg, S., Chang, S. and Hollenback, J. (2000) An introduction to the diagnostic evaluation of the Switchboard- corpus automatic speech recognition systems. Proceedings of the NIST Speech Transcription Workshop, College Park Greenberg, S. and Chang, S. (2000) Linguistic dissection of switchboard-corpus automatic recognition systems. Proceedings of the ICSI Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris. AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATION Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing. Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724. STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany. Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176, Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. AUTOMATIC LABELING OF PROSODIC STRESS IN SPONTANEOUS SPEECH Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous english discourse. Proceedings of the International Congress of Phonetic Sciences, San Francisco. Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park.

102 Publications - Linguistic ASR (continued) MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITION Greenberg, S. and Kingsbury, B. (1997) The modulation spectrogram: In pursuit of an invariant representation of speech, in ICASSP-97, IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 1647-1650. Kingsbury, B., Morgan, N. and Greenberg, S. (1999) The modulation-filtered spectrogram: A noise-robust speech representation, in Proceedings of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland. Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Robust speech recognition using the modulation spectrogram, Speech Communication, 25, 117-132. SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITION Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Incorporating information from syllable-length time scales into automatic speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 721-724. Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Performance improvements through combining phone- and syllable-length information in automatic speech recognition, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 854-857. PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY GERMANE TO ASR Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest.

103 Syllable Frequency - Spontaneous English The distribution of syllable frequency in spontaneous speech differs markedly from that in dictionaries

104 Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10 Word Frequency in Spontaneous English Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc. Computed from the Switchboard corpus (American English telephone dialogues)

105 The Intricate Web of Research


Download ppt "Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Science Institute 1947 Center Street,"

Similar presentations


Ads by Google