Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December.

Similar presentations


Presentation on theme: " Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December."— Presentation transcript:

1  Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December 1976. T. Chen and R. Rao, ''Audio-visual integration in multimodal communication'', Proceedings of the IEEE, Special issue in Multimedia Signal Processing, vol. 86, pp. 837-852, May 1998. D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and Machines. Springer, Berlin Germany, 1996.  Through their integration (fusion), the aim is: To increase the robustness and performance. There are too many papers, so lets see a few... but from the point of view of integration most of them. AudioVisual-SpeechRecognition (AVSR)

2 ● Tutorials. G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil. 91(9), pp. 1306-1326, September 2003. G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004. ● Real Conditions. G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging environments'', In proc. European conference on Speech Technology, pp. 1293- 1296, 2003. G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N. Haas and J. Jiang, ''Towards practical deployement of audio-visual speech recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004. AudioVisual-SpeechRecognition (AVSR)

3  Increase Robustness and Performance Based on the fact: Visual modality is independent to most of the lost of acoustic quality. Visual and Acoustic modalities work in a complementary manner. B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''. London, England. Laurence Erlbaum Associates Ltd., 1987. But, if the integrations is not well done: Catastrofic fusion. J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition'', Tech. Rep. 97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997. AudioVisual-SpeechRecognition (AVSR)

4  Early Integration (EI): In the feature level, concatenate the features. But features are not synchronous (VOT)! S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000. C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September 1993. AVSR Integration (Fusion) voice onset time (VOT)

5  Late Integration (LI): In the decision level, combine the scores. Lost of all temporal information! A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September 1995. S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September 2000. M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting in audio-visual speech recognition'', EUROASIP Journal of Applied Signal Processing, vol. 1, pp. 1260-1273, November 2002. AVSR Integration (Fusion)

6  Middle Integration (MI) allows: Specific word or sub-word models. Synchronous continuous speech recognition. J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172, Salt Lake City USA, May 2001. G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May 2001. AVSR Integration (Fusion)

7  Multistream HMM State synchrony Weighting the observations A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002. G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, 2004. AVSR Integration, Dynamic Bayesian Networks t=1t=0 t=2 t=T

8  Product HMM Asynchrony between the streams Too many parameters I am not sure about this graphical representation G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual speech recognition'', In Human Language Technology Conference, 2002. AVSR Integration, Dynamic Bayesian Networks t=1 t=0 t=2 t=T

9  Factorial HMM Transition probabilities are independents for each stream. Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc. Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985. AVSR Integration, Dynamic Bayesian Networks t=1 t=0 t=2 t=T

10  Coupled HMM (1/2) The backbones have a dependence. M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 994-999, 1997. S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov models'', ICASSP'02, pp. 2009-2012, 2002. AVSR Integration, Dynamic Bayesian Networks t=1t=0 t=2 t=T

11  Coupled HMM (2/2) A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, 2002. A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech integration using coupled hidden markov models for continous speech recognition'', ICASSP'03, 2003. AVSR Integration, Dynamic Bayesian Networks

12  Implicite Modeling J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004. X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005. AVSR Integration, Dynamic Bayesian Networks


Download ppt " Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December."

Similar presentations


Ads by Google