Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.

Slides:



Advertisements
Similar presentations
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Advanced Speech Enhancement in Noisy Environments
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Prosodic measurements and question types in the Spontal corpus of Swedish dialogues Sofia Strömbergsson, Jens Edlund & David House KTH Speech, Music and.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
1 An Assessment of a Speech-Based Programming Environment Andrew Begel Microsoft Research (formerly UC Berkeley)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
A Fully Annotated Corpus of Russian Speech
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Basic structure of sphinx 4
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Michael C. W. Yip The Education University of Hong Kong
Automatic Speech Recognition
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
Course Projects Speech Recognition Spring 1386
3.0 Map of Subject Areas.
Automatic Speech Recognition
Audio Books for Phonetics Research
Turn-taking and Disfluencies
Automatic Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics)

2Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Introduction: GSLC  GSLC (Göteborg Spoken Language Corpus): A multimodal corpus: Video and/or audio recording GTS (Göteborg Transcription Standard)  Overlaps on word level, background information, and comments relevant for interaction MSO (Modified Standard Orthography)  Closer to speech than written language  NOT phonetic  Keeps possibilities to compare to written language Designed for studies of natural speech in various activities 25 social activity types – 200 hours – 360 recordings – 1.3 million running words Recording/transcription only aligned for a few recordings, not word by word

3Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Transcription example $L: he:ej $G: heej $L: heej hur haru haft d{et} i $G: jättebra ja{g} tycker inte att du är ett svart hål $L: tycker [4 du inte d{et} va bra ]4 $V: [4 jo:e kolla lungan ]4 $G: ja{g} tycker att du e0 [5 blå å0 (...) ]5 jo d{et} kan du väl ändå [6 tycka tycker ja ]6 $L: [5 ja tycker inte att du e0 en röd stjärna ]5 $L: [6 nä ]6 $V: va{d} tycker [7 ni själva rå1 ]7 $L: [7 röd ja ]7 stjärna ja $G: kan du ö{h} sluta avbryta [8 oss vi håller ]8 $V: [8 va{d} e0 ni va{d} tycke{r} ni själva att ni ]8 e0 $L: (...) $G: va

4Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 MultiTool  Prerelease 0.7  Browsing, searching, coding, counting Easy navigation through recordings Search in transcription, partiture, media file, or time scale  Only manual alignment  Partial alignment of specific events would help a lot!

5Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04

6Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 What can speech technology do for MultiTool?  A lot of research I didn’t know about…  Question: should we use the transcription or not? Yes: Automatic forced alignment on word level No: Speech recognition + alignment  Yes, find the time for: Utterance start and end points Non speech annotations (coughing, whispering, click, loud, high pitch, glottalization, etc) and silent sections Easy-to-recognize speech sounds or words Find out if two utterances are uttered by the same person

7Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Challenging task…  Speech recognition/alignment work best with high quality sound signals  Recordings of spontaneous speech in natural situations have some unwanted properties: Long distance between microphone and speaker Many speakers in the same signal Overlapped speech Unlimited vocabulary Whatever you call it: Disfluencies, repairs, repetitions, deletions, fragmental speech Various background noise  Will any of the existing methods work here?

8Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Existing research  The Production of Speech Corpora (Schiel et. al.) – fully automatic methods with usable results: Segmentation into words, if known vocabulary and not very spontaneous speech Markup of prosodic features Time alignment of phonemes (+ probabilistic pronunciation rules give word alignment

9Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Research, cont.  Sentence boundary tagging (Stolcke & Shriberg 1996) Probabilities for boundaries between words HMM + Viterbi POS-tags improves Good sound quality Interesting, but sentences are not utterances  Inter-word event tagging (Stolcke et. al. 1998) Events are disfluencies in general Input is forced alignment + acoustic features  Not directly usable but, similar model and acoustic features may be useful for other events as well

10Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 HMM-based segmentation and alignment  Find the most probable alignment for a sequence of words  Sjölander (2003) describes an interesting system Very interesting! Reports correct alignment for 85.5% of boundaries within 20ms Will it work on noisy signals? A result of say 5% would be very useful I have tried to get the system…

11Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Related tasks  Intensity discrimination Easy to measure Useful as indicator for phoneme changes, etc.  Voicing Determination and Fundamental Frequency Many methods: Cepstrum, probabilities based on weighted features Voicing patterns could give good hints when specific words occur.  Glottalization and impulse detection Intensity and sudden f 0 decrease could be used Glottalization is marked in the transcription!

12Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Robust alignment  How could the algorithm used by Sjölander be revised for more robustness?  f 0 (voicing) and glottalization detection + ordinary probabilities for phonemes could help  Problem: the speech models will not give probabilities for phonemes in simultaneous speech  Problem #2: GSLC does not contain phonetic transcription Would training on letters work?  My guess: this will not work good enough  Better approach to identify things that could be recognized since word-by-word alignment is not necessary

13Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 Conclusion  First thing to try: Sjölander’s aligner  Second: Spoken event tagger Identify events that could be recognized Identify useful acoustic features May for example a decision tree help to recognize the events?  Lots of test and experiments will be needed, if the forced alignment doesn’t give useable results

14Robust Methods for Automatic Transcription and Alignment of Speech SignalsStockholm 6. Feb -04 The End! Thank you for listening ??? !!!