Phonetic features in ASR Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999 Jacques Koreman Institute.

Slides:



Advertisements
Similar presentations
Tom Lentz (slides Ivana Brasileiro)
Advertisements

Building an ASR using HTK CS4706
Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Phonetics.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Speech Science XII Speech Perception (acoustic cues) Version
Digital Systems: Hardware Organization and Design
Unsupervised learning. Summary from last week We explained what local minima are, and described ways of escaping them. We investigated how the backpropagation.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Phonetics (Part 1) Dr. Ansa Hameed.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Natural Language Processing - Speech Processing -
Speaker Adaptation for Vowel Classification
COMP 4060 Natural Language Processing Speech Processing.
Consonants and vowel January Review where we’ve been We’ve listened to the sounds of “our” English, and assigned a set of symbols to them. We.
A PRESENTATION BY SHAMALEE DESHPANDE
English Pronunciation Practice A Practical Course for Students of English By Wang Guizhen Faculty of English Language & Culture Guangdong University of.
Representing Acoustic Information
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Chapter 2 Speech Sounds Phonetics and Phonology
The Sounds of Language. Phonology, Phonetics & Phonemics… Phonology, Phonetics & Phonemics… Producing and writing speech sounds... Producing and writing.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Science VII Acoustic Structure of Speech Sounds WS
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,
Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Bistra Attilio.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
SEPARATION OF CO-OCCURRING SYLLABLES: SEQUENTIAL AND SIMULTANEOUS GROUPING or CAN SCHEMATA OVERRULE PRIMITIVE GROUPING CUES IN SPEECH PERCEPTION? William.
WEBSITE Please use this website to practice what you learn during lessons 1.
The Goals of Phonology: to note and describe the sound patterns in language(s) to detect and taxonomize (classify) general patterns to explain these patterns.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Stop + Approximant Acoustics
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Acoustic Phonetics 3/14/00.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Structure of Spoken Language
Job Google Job Title: Linguistic Project Manager
Structure of Spoken Language
Automatic Speech Recognition: Conditional Random Fields for ASR
Speech Perception (acoustic cues)
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

Phonetic features in ASR Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999 Jacques Koreman Institute of Phonetics University of the Saarland P.O. Box D Saarbrücken

Organisation of the course Tuesday – Friday: - First half of each session:theory - Second half of each session:practice Interruptions invited!!!

Overview of the course 1.Variability in the signal 2.Phonetic features in ASR 3.Deriving phonetic features from the acoustic signal by a Kohonen network 4.ICSLP’98: “Exploiting transitions and focussing on linguistic properties for ASR” 5.ICSLP’98: “Do phonetic features help to improve consonant identification in ASR?”

The goal of ASR systems Input: spectral description of microphone signal, typically - energy in band-pass filters - LPC coefficients - cepstral coefficients Output: linguistic units, usually phones or phonemes (on the basis of which words can be recognised)

Variability in the signal (1) Main problem in ASR: variability in the input signal Example: /k/ has very different realisations in different contexts. Its place of articulation varies from velar before back vowels to pre-velar before front vowels (own articulation of “keep”,“cool”)

Variability in the signal (2) Main problem in ASR: variability in the input signal Example: /g/ in canonical form is sometimes realised as a fricative or approximant, e.g. intervocalically (OE. regen > E. rain). In Danish, this happens to all intervocalic voiced plosives; also, voiceless plosives become voiced.

Variability in the signal (3) Main problem in ASR: variability in the input signal Example: /h/ has very different realisations in different contexts. It can be considered as a voiceless realisation of the surrounding vowels. (spectrograms “ihi”, “aha”, “uhu”)

Variability in the signal (3a) i: h a: hhu: [] ] ] [[

Variability in the signal (4) Main problem in ASR: variability in the input signal Example: deletion of segments due to articulat- ory overlap. Friction is superimposed on the vowel signal. (spectrogram G.“System”)

Variability in the signal (4a) dep0simalzYb0 s p0tem[] (

Variability in the signal (5) Main problem in ASR: variability in the input signal Example: the same vowel /a:/ is realised differ- ently dependent on its context. (spectrogram “aba”, “ada”, “aga”)

Variability in the signal (5a) a: b0dg[][[]] b

Modelling variability Hidden Markov models can represent the variable signal characteristics of phones S E 1- p 3 1p1p1 p3p3 p2p2 1- p 2 1- p 1

Lexicon and language model (1) Linguistic knowledge about phone sequences (lexicon, language model) improves word recognition Without linguistic knowledge, low phone accuracy

Lexicon and language model (2) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example: [r  sp  ] Recognise speech Wreck a nice beach

Lexicon and language model (3) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example:[  ] Get up at eight o’clock Get a potato clock

CONCLUSIONS The acoustic parameters (e.g. MFCC) are very variable. We must try to improve phone accuracy by extracting linguistic information. Rationale: word recognition rates will increase if phone accuracy improves BUT: not all our problems can be solved Practical:

Phonetic features in ASR Assumption: phone accuracy can be improved by deriving phonetic features from the spectral representation of the speech signal What are phonetic features?

A phonetic description of sounds The articulatory organs

A phonetic description of sounds The articulation of consonants velum (= soft palate) tongue

A phonetic description of sounds The articulation of vowels

Phonetic features: IPA IPA (International Phonetic Alphabet) chart - consonants and vowels - only phonemic distinctions (

The IPA chart (consonants)

The IPA chart (other consonants)

The IPA chart (non-pulm. cons.)

The IPA chart (vowels)

The IPA chart (diacritics)

IPA features (obstruents) l d a p v u g p f n l a t v a e l a e v l l r a a p r o b n v l l u o o i s t r i i p b p t k b d g f T s S C x vfri vapr Dfri z Z

IPA features (sonorants) l d a p v u g p f n l a t v a e l a e v l l r a a p r o b n v l l u o o i s t r i i m n J N l L rret ralv Ruvu j w h A zero value is assigned to all vowel features (not listed here)

IPA features (vowels) A zero value is assigned to all consonant features (not listed here) m o f c r m o f c r i p r e o i p r e o d e o n u d e o n u i I y Y u U e o O V Q Uschwa { a A E

Phonetic features Phonetic features - different systems (JFH, SPE, art. feat.) - distinction between “natural classes” which undergo the same phonological processes

SPE features (obstruents) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n p b p b tden t d k g f vfri T Dfri s z S Z C x

SPE features (sonorants) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n m n J N l L ralv Ruvu rret j vapr w h XXX

SPE features (vowels) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n i I e E { a y Y A Q V O o U u Uschwa

CONCLUSION Different feature matrices have different implications for relations between phones Practical:

Kohonen networks Kohonen networks are unsupervised neural networks Our Kohonen networks take vectors of acoustic parameters (MFCC_E_D) as input and output phonetic feature vectors Network size: 50 x 50 neurons

Training the Kohonen network 1.Self-organisation results in a phonotopic map 2.Phone calibration attaches array of phones to each winning neuron 3.Feature calibration replaces array of phones by array of phonetic feature vectors 4.Averaging of phonetic feature vectors for each neuron

Mapping with the Kohonen network Acoustic parameter vector belonging to one frame activates neuron Weighted average of phonetic feature vector attached to winning neuron and K-nearest neurons is output

Advantages of Kohonen networks Reduction of features dimensions possible Mapping onto linguistically meaningful dimensions (phonetically less severe confusions) Many-to-one mapping allows mapping of different allophones (acoustic variability) onto the same phonetic feature values automatic and fast mapping

Disadvantages of Kohonen networks They need to be trained on manually segmented and labelled material BUT: cross-language training has been shown to be succesful

Hybrid ASR system hidden Markov modelling language model lexicon phonetic features phone Kohonen network BASELINE Kohonen network MFCC’s + energydelta parameters BASELINE Kohonen network phone

CONCLUSION Practical: Acoustic-phonetic mapping extracts linguistically relevant information from the variable input signal.

ICSLP’98 Exploiting transitions and focussing on linguistic properties for ASR Jacques Koreman William J. Barry Bistra Andreeva Institute of Phonetics, University of the Saarland Saarbrücken, Germany

Variation in the speech signal caused by coarticulat- ion between sounds is one of the main challenges in ASR. Exploit variation if you cannot reduce it Coarticulatory variation causes vowel transitions to be acoustically less homogeneous, but at the same time provides information about neighbour- ing sounds whichcan be exploited (experiment 1). Reduce variation if you cannot exploit it Some of the variation is not relevant for the phon- emic identity of the sounds. Mapping of acoustic parameters onto IPA-based phonetic features like [± plosive] and [± alveolar] extracts only linguist- ically relevant properties before hidden Markov modelling is applied (experiment 2). INTRODUCTION

The controlled experiments presented here reflect our general aim of using phonetic knowledge to improve the ASR system architecture. In order to evaluate the effect of the changes in bottom-up processing, no lexicon or language model is used. Both improve phone identification in a top-down manner by preventing the identification of inadmissible words (lexical gaps or phonotactic restrictions) or word sequences. No lexicon or language model

DATA Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 English, German, Italian and Dutch texts from the EUROM 0 database, read by 2 male + 2 female speakers per language Texts

DATA 12 mel-frequency cepstral coefficients ( MFCC ’s) energy corresponding delta parameters 16 kHz microphone signals Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 Signals

DATA Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 Intervocalic consonants labelled with SAMPA symbols, except plosives and affricates, which are divided into closure and frication subphone units 35-ms vowel transitions labelled as i_lab, alv_O (experiment 1) V_lab, alv_V (experiment 2) wherelab, alv= cons. generalized across place V= generalized vowel Labels

EXPERIMENT 1: SYSTEM Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 consonant hidden Markov modelling BASELINE lexicon Voffset - C - Vonset MFCC’s + energy + delta parameters C language model MFCC’s + energy + delta parameters

EXPERIMENT 1: RESULTS

EXPERIMENT 1: CONCLUSIONS When vowel transitions are used: consonant identification rate improves place better identified manner identified worse, because hidden Markov models for vowel transitions generalize across all consonants sharing the same place of articul- ation (solution: do not pool consonants sharing the same place of articulation) vowel transitions can be exploited for identification of the consonant, particularly its place of articulation

EXPERIMENT 2: SYSTEM consonant language model lexicon phonetic features hidden Markov modelling C Kohonen network BASELINE Kohonen network MFCC’s + energy delta parameters BASELINE Kohonen network

EXPERIMENT 2: RESULTS

EXPERIMENT 2: CONCLUSIONS phonetic features better address linguistically relevant information than acoustic parameters When acoustic-phonetic mapping is applied: consonant identification rate improves strongly place better identified manner better identified

EXPERIMENT 3: SYSTEM consonant language model lexicon phonetic features hidden Markov modelling Kohonen network BASELINE Voffset - C - Vonset C MFCC’s + energydelta parameters

EXPERIMENT 3: RESULTS

EXPERIMENT 3: CONCLUSIONS vowel transitions do not increase identification rate: because baseline identification rate is already high vowel transitions are undertrained in the Kohonen networks When transitions are used for acoustic-phonetic mapping: consonant identification rate does not improve place identification improves slightly manner identification rate decreases slightly

INTERPRETATION (1) The greatest improvement in consonant identification is achieved in experiment 2. By mapping acoustically different realisations of consonants onto more similar phonetic features, the input to hidden Markov modelling becomes more homogeneous, leading to a higher consonant identification rate. Using vowel transitions also leads to a higher consonant identification rate in experiment 1. It was shown that particularly the consonants’ place is identified better. Findings confirm the importance of transitions as known from perceptual experiments.

INTERPRETATION (2) The additional use of vowel transitions when acoustic-phonetic mapping is applied does not improve the identification results. Two possible explanations for this have been suggested: The latter interpretation is currently being verified by Sibylle Kötzer by applying the methodology to a larger database ( TIMIT ).  the identification rates are high anyway when mapping is applied, so that it is less likely that large improvements are found  the generalized vowel transitions are undertrained in the Kohonen networks, because the intrinsically variable frames are spread over a larger area in the phonotopic map.

REFERENCES (1) Bitar, N. & Espy-Wilson, C. (1995a). Speech parameterization based on phonetic features: application to speech recognition. Proc. 4th Eurospeech, Cassidy, S & Harrington, J. (1995). The place of articulation distinction in voiced oral stops: evidence from burst spectra and formant transitions. Phonetica 52, Delattre, P., Liberman, A. & Cooper, F. (1955). Acoustic loci and transitional cues for consonants. JASA 27(4), Furui, S. (1986). On the role of spectral transitions for speech preception. JASA 80(4), Koreman, J., Andreeva, B. & Barry, W.J. (1998). Do phonetic features help to improve consonant identification in ASR? Proc. ICSLP.

REFERENCES (2) Koreman, J., Barry, W.J. & Andreeva, B. (1997). Relational phonetic features for consonant identification in a hybrid ASR system. PHONUS 3, Saarbrücken (Germany): Institute of Phonetics, University of the Saarland. Koreman, J., Erriquez, A. & W.J. Barry (to appear ). On the selective use of acoustic parameters for consonant identification. PHONUS 4. Saarbrücken (Germany): Institute of Phonetics, University of the Saarland. Stevens, K. & Blumstein, S. (1978). Invariant cues for place of articulation in stop consonants. JASA 64(5),

SUMMARY Practical: Acoustic-phonetic mapping by a Kohonen network improves consonant identification rates.

ICSLP’98 Do phonetic features help to improve consonant identification in ASR? Jacques Koreman Bistra Andreeva William J. Barry Institute of Phonetics, University of the Saarland Saarbrücken, Germany

INTRODUCTION Variation in the acoustic signal is not a problem for human perception, but causes inhomogeneity in the phone models for ASR, leading to poor consonant identification. We should Bitar & Espy-Wilson do this by using a knowledge-based event- seeking approach for extracting phonetic features from the microphone signal on the basis of acoustic cues. We propose an acoustic-phonetic mapping procedure on the basis of a Kohonen network. “directly target the linguistic information in the signal and... minimize other extra-linguistic information that may yield large speech variability” (Bitar & Espy-Wilson 1995a, p. 1411)

DATA English, German, Italian and Dutch texts from the EUROM 0 database, read by 2 male + 2 female speakers per language Texts

DATA 12 mel-frequency cepstral coefficients ( MFCC ’s) energy corresponding delta parameters 16 kHz microphone signals Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 Signals

DATA (1) Labels plosives and afficates are subdivided into a closure (“p0” = voiceless closure; “b0” = voiced closure) and a burst-plus- aspiration (“p”, “t”, “k”) or frication part (“f”, “s”, “S”, “z”, “Z”) Italian geminates were pooled with non-geminates to prevent undertraining of geminate consonants The Dutch voiced velar fricative [  ], which only occurs in some dialects, was pooled with its voiceless counterpart [x] to prevent undertraining The consonants were transcribed with SAMPA symbols, except:

DATA (2) Labels SAMPA symbols are phonemic within a language, but can represent different allophones cross-linguistically. These were relabelled as shown in the table below: SAMPA allophone label descriptionlanguage r  rapralv. approx.English r ralvalveolar trillIt., Dutch  Ruvuuvular trillG., Dutch v  vapr labiod. approx.German v vfrivd. labiod. fric.E., It., NL w  vaprlabiod. approx.Dutch w wbilab. approx.Engl., It.

SYSTEM ARCHITECTURE consonant language model lexicon phonetic features hidden Markov modelling C Kohonen network BASELINE Kohonen network MFCC’s + energy delta parameters BASELINE Kohonen network

CONFUSIONS BASELINE phonetic categories: manner, place, voicing 1 category wrong 2 categories wrong 3 categories wrong (by Attilio Erriquez)

CONFUSIONS MAPPING phonetic categories: manner, place, voicing 1 category wrong 2 categories wrong 3 categories wrong (by Attilio Erriquez)

ACIS = Baseline system:31.22 % Mapping system:68.47 % total of all correct identification percentages number of consonants to be identified The Average Correct Identification Score compensates for the number of occurrences in the database, giving each consonant equal weight. It is the total of all percentage numbers along the diagonal of the confusion matrix divided by the number of consonants.

BASELINE SYSTEM good identification of language-specific phones reason: acoustic homogeneity poor identification of other phones % correct cons. baseline mapping language  German  Italian  Italian  English  Engl., It.  English x G, NL

MAPPING SYSTEM good identification, also of acoustically variable phones reason: variable acoustic parameters are mapped onto homogenous, distinctive phonetic features % correct cons. baseline mappinglanguage h E,G, NL k all b all d all t all p all etc.

AFFRICATES (1) % correct cons. baseline mapping language pf German f all ts German, It. s all t  E., G., It.  all dz Italian z all d  English, It.  no intervocalic realisations

AFFRICATES (2) affricates, although restricted to fewer languages, are recognised poorly in the baseline system reason: they are broken up into closure and frication segments, which are trained separately in the Kohonen networks; these segments occur in all languages and are acoustically variable, leading to poor identification this is corroborated by the poor identification rates for fricatives in the baseline system (exception: /  /, which only occurs rarely) after mapping, both fricatives and affricates are identified well

APMS = Baseline system:1.79 Mapping system:1.57 The Average Phonetic Misidentification Score gives a measure of the severity of the consonant confusions in terms of phonetic features. The multiple is the sum of all products of the misidentification percentages (in the non-diagonal cells) times the number of misidentified phonetic categories (manner, place and voicing). It is divided by the total of all the percentage numbers in the non- diagonal cells. phonetic misidentification coefficient sum of the misidentification percentages

APMS = after mapping, incorrectly identified consonant is on average closer to the phonetic identity of the consonant which was produced reason: the Kohonen network is able to extract linguistically distinctive phonetic features which allow for a better separation of the consonants in hidden Markov modelling. phonetic misidentification coefficient sum of the misidentification percentages

CONSONANT CONFUSIONS cons. identified as rr (84%),  (5%), l (4%) jj (94%), z (6%) mm (63%), n (11%),  (10%), r (6%) nn (26%), m (21%),  (20%), r (6%)  (46%), n (23%), m (15%),  (8%) cons. identified as rg (61%),  (16%),  (13%) j  (53%), j (18%),  (12%),  (6%), r (6%),  (6%) m  (23%),  (18%), m (16%),  (13%),  (10%) n  (28%),  (18%),  (16%),  (12%), m (8%),  (8%)  (42%),  (15%),  (15%), m (8%),  (8%),  (8%) BASELINE MAPPING

CONCLUSIONS Acoustic-phonetic mapping helps to address linguistically relevant information in the speech signal, ignoring extra- linguistic sources of variation. The advantages of mapping are reflected in the two measures which we have presented: ACIS shows that mapping leads to better consonant identification rates for all except a few of the language- specific consonants. The improvement can be put down to the system’s ability to map acoustically variable consonant realisations to more homogeneous phonetic feature vectors.

CONCLUSIONS Acoustic-phonetic mapping helps to address linguistically relevant information in the speech signal, ignoring extra- linguistic sources of variation. The advantages of mapping are reflected in the two measures which we have presented: APMS shows that the confusions which occur in the mapping experiment are less severe than in the baseline experiment from a phonetic point of view. There are fewer confusions on the phonetic dimensions manner, place and voicing when mapping is applied, because the system focuses on distinctive information in the acoustic signals.

REFERENCES (1) Bitar, N. & Espy-Wilson, C. (1995a). Speech parameterization based on phonetic features: application to speech recognition. Proc. 4th European Conference on Speech Communication and Technology, Bitar, N. & Espy-Wilson, C. (1995b). A signal representation of speech based on phonetic features. Proc. 5th Annual Dual-Use Techn. and Applications Conf., Kirchhoff, K. (1996). Syllable-level desynchronisation of phonetic features for speech recognition. Proc. ICSLP., Dalsgaard, P. (1992). Phoneme label alignment using acoustic- phonetic features and Gaussian probability density functions. Computer Speech and Language 6,

REFERENCES (2) Koreman, J., Barry, W.J. & Andreeva, B. (1997). Relational phonetic features for consonant identification in a hybrid ASR system. PHONUS 3, Saarbrücken (Germany): Institute of Phonetics, University of the Saarland. Koreman, J., Barry, W.J., Andreeva, B. (1998). Exploiting transitions and focussing on linguistic properties for ASR. Proc ICSLP. (these proceedings).

SUMMARY Acoustic-phonetic mapping leads to fewer and phonetically less severe consonant confusions. Practical:

THE END