Juicer: A weighted finite-state transducer speech decoder

Slides:



Advertisements
Similar presentations
LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
Advertisements

Building an ASR using HTK CS4706
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
PSY 369: Psycholinguistics Some basic linguistic theory part3.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
專題研究 WEEK3 LANGUAGE MODEL AND DECODING Prof. Lin-Shan Lee TA. Hung-Tsung Lu,Cheng-Kuan Wei.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments.
Introduction to CL & NLP CMSC April 1, 2003.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Tools Of Structured Analysis
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Reza Yazdani Albert Segura José-María Arnau Antonio González
Chapter 1 Introduction.
Inquiry learning and SimQuest
1.3 Finite State Machines.
Finite State Machines Dr K R Bond 2009
PRESENTED BY: PEAR A BHUIYAN
Digital Technologies: Curriculum Connections F to 6
ISP Coverage Criteria CS 4501 / 6501 Software Testing
An overview of decoding techniques for LVCSR
Automatic Speech Recognition Introduction
The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.
Chapter 1 Introduction.
專題研究 week3 Language Model and Decoding
WELCOME TO COSRI IBADAN
8.0 Search Algorithms for Speech Recognition
FSM Mapping to Architecture
Finite State Machines Computer theory covers several types of abstract machines, including Finite State Machines.
Automatic Speech Recognition
*current controlled assessment plans are unknown
Tight Coupling between ASR and MT in Speech-to-Speech Translation
The Extensible Tool-chain for Evaluation of Architectural Models
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Automatic Speech Recognition: Conditional Random Fields for ASR
Multimodal Human-Computer Interaction New Interaction Techniques 22. 1
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
CSE 370 – Winter Sequential Logic-2 - 1
LECTURE 15: REESTIMATION, EM AND MIXTURES
Automated Analysis and Code Generation for Domain-Specific Models
Automatic Speech Recognition
Automatic Speech Recognition
Dynamic Programming Search
Automatic Speech Recognition
Listen Attend and Spell – a brief introduction
Huawei CBG AI Challenges
Multidisciplinary Optimization
Presentation transcript:

Juicer: A weighted finite-state transducer speech decoder D. Moore1, J. Dines1, M. Magimai Doss1, J. Vepa1, O. Cheng1 and T. Hain2 1 IDIAP Research Institute 2 Department of Computer Science, University of Sheffield

Overview The speech decoding problem Why develop another decoder? WFST theory and practice What is Juicer? Benchmarking experiments The future of Juicer

The speech decoding problem Given a recording and models of speech & language, generate a text transcription of what was said Decoder She had your dark suit…. Models

The speech decoding problem Or…

The speech decoding problem Or…

The speech decoding problem ASR system building blocks Grammar N-gram language model Lexical knowledge pronunciation dictionary Phonetic knowledge context dependency phonological rules Acoustic knowledge state distributions Naive combination of these knowledge sources leads to a large, inefficient representation of the search space

The speech decoding problem The main issue in decoding is carrying out an efficient search of the space defined by the knowledge sources Two ways we can do this: Avoid performing redundant search Don’t pursue unpromising hypotheses An additional issue: flexibility of the decoder

Why develop another decoder? Need of a state-of-the-art speech decoder that is also suitable for on-going research At present, such software is not freely available to the research community Open-source development and distribution framework

WFST theory and practice Maps sequences of input symbols to sequences of output symbols Transition pairs have an associated weight In the example: Input sequence I={a b c d} maps to output sequence O={X Y Z W}, with the path weight a function of all transition weights associated with that path, f(0.1,0.2,0.5,0.1)

WFST theory and practice WFST operations Composition: Combination of transducers Determinisation: Only one transition per input label Minimisation: Least number of states and transitions Weight pushing to aid in minimisation

WFST theory and practice Composition

WFST theory and practice Determinisation

WFST theory and practice Weight pushing & minimisation

WFST theory and practice WFST and speech decoding ASR system building blocks Grammar Lexical knowledge Phonetic knowledge Acoustic knowledge Each of these knowledge sources has a WFST representation

WFST theory and practice WFST and speech decoding Requires some special considerations Lexicon and grammar composition can not be determinised Nor can the context dependency transducer where L, G, C are WFSTs for the grammar, lexicon and context dependency

WFST theory and practice WFST and speech decoding Pros Flexibility Simple decoder architecture Optimised search space Cons Transducer size Knowledge sources are fixed during composition WFST-only knowledge sources

What is Juicer? A time-synchronous Viterbi decoder Tools for WFST construction An interface between 3rd party FSM tools

State-to-phone transducer is not optimised What is Juicer? Decoder Pruning beam search, histogram 1-best output word and model timing information Lattice generation phone level lattice output State-to-phone transducer is not optimised incorporated at run time

What is Juicer? WFST tools gramgen word-loop, word pair, N-gram language models lexgen multiple pronunciations cdgen monophone, word-internal n-phone, cross-word triphone HTK CDHMM and hybrid HMM/ANN model support build-wfst composition, determinisation and minimisation using 3rd party tools (AT&T, MIT)

Benchmarking experiments Experiments were conducted in order to: Compare with existing state-of-the-art decoders Assess the current capabilities and limitations of the decoder Guide future development and research directions

Benchmarking experiments 20k Wall Street Journal Task Equivalent performance on wide beam settings HDecode wins out on narrow beam-widths Only part of the story…

Benchmarking experiments …but what’s the catch? Composition of large static networks: Practically infeasible due to memory limitations Is slow And may not always be necessary System TOT Sub Del Ins P1.HDecode 41.1 21.1 14.7 5.3 P1.Juicer 43.5 23.0 13.7 7.8 P2.HDecode 33.1 15.9 13.4 3.9 P2.Juicer 34.5 16.9 13.6 4.0 Language # of arcs #of arcs FSM Time Model G L C Tool L o G C o L o G Required Pruned-07 4,145,199 127,048 1,065,766 AT&T + MIT 7,008,333 14,945,731 30mins Pruned-08 13,692,081 MIT 23,160,795 50,654,758 1:44 Pruned-09 35,895,383 59,626,339 120,060,629 5:38 Unpruned 98,288,579 DNF 10:33+

Benchmarking experiments AMI Meeting Room Recogniser Decoding for the NIST Rich Transcription evaluations Juicer uses pruned LMs Good trade-off between RTF and WER Chosen operating point

The future of Juicer Further benchmarking Added capabilities Testing against HDecode Trade off between pruned LMs and performance Added capabilities ‘On the fly’ network expansion Word lattice generation Support for MLLR transforms, feature transforms Distribution and support Currently only available to AMI and IM2 partners

Summary Questions? *** but more importantly *** I have presented today… WFST theory and practice The Juicer tools and decoder Preliminary experiments *** but more importantly *** We hope to have generated interest in Juicer