A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Spatial and Temporal Data Mining
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Metaheuristics The idea: search the solution space directly. No math models, only a set of algorithmic steps, iterative method. Find a feasible solution.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
Computer Vision – Compression(2) Hanyang University Jong-Il Park.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CS 395 T Real-Time Graphics Architectures, Algorithms, and Programming Systems Spring’03 Vector Quantization for Texture Compression Qiu Wu Dept. of ECE.
Graphical models for part of speech tagging
Intro. ANN & Fuzzy Systems Lecture 36 GENETIC ALGORITHM (1)
Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Transmit Diversity with Channel Feedback Krishna K. Mukkavilli, Ashutosh Sabharwal, Michael Orchard and Behnaam Aazhang Department of Electrical and Computer.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Stochastic Multicast with Network Coding Ajay Gopinathan, Zongpeng Li Department of Computer Science University of Calgary ICDCS 2009, June , Montreal.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
CS Statistical Machine learning Lecture 24
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Application-Aware Traffic Scheduling for Workload Offloading in Mobile Clouds Liang Tong, Wei Gao University of Tennessee – Knoxville IEEE INFOCOM
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Presentation III Irvanda Kurniadi V. ( )
S.R.Subramanya1 Outline of Vector Quantization of Images.
Reza Yazdani Albert Segura José-María Arnau Antonio González
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds
Foundation of Video Coding Part II: Scalar and Vector Quantization
LECTURE 15: REESTIMATION, EM AND MIXTURES
Data Transformations targeted at minimizing experimental variance
Presentation transcript:

A Low-Power Low-Memory Real-Time ASR System

Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization Custom arithmetic back-end Power simulation

ASR System Organization Front-end –Transform signal into a set of feature vectors Back-end –Given the feature vectors, find the most likely word sequence –Accounts for 90% of the computation Model parameters –Learned offline Dictionary –Customizable –Requires embedded HMMs Goal: Given a speech signal find the most likely corresponding sequence of words

Embedded HMM Decoding Mechanism 1T open the window

ASR on Portable Devices Problem –Energy consumption is a major problem for mobile speech recognition applications –Memory usage is a main component of energy consumption Goal –Minimize power consumption and memory requirement while maintaining high recognition rate Approach –Sub-vector clustering and parameter quantization –Customized architecture

Outline Overview of speech recognition Sub-vector clustering and parameter quantization Custom arithmetic back-end Power simulation

Sub-vector Clustering Given a set of input vectors, sub-vector clustering involves two steps: 1) Sub-vector selection: find the best disjoint partition of each vector into M sub-vectors 2) Quantization: find the best representative sub- vectors (stored in codebooks) Special cases –Vector quantization: no partition of the vectors (M=1) –Scalar quantization: size of each sub-vector is 1 Two methods of quantization –Disjoint: a separate codebook for each partition –Joint: shared codebooks for same size sub-vectors

Why Sub-vector Clustering? Vector quantization –Theoretically best –In practice requires a large amount of data Scalar quantization –Requires less data –Ignores correlation between vector elements Sub-vector quantization –Exploits dependencies and avoids data scarcity problems

Algorithms for Sub-vector Selection Doing an exhaustive search is exponential. We use several heuristics Common feature of these algorithms: the use of entropy or mutual information as a measure of correlation Key idea: choose clusters that maximize intra- cluster dependencies while minimizing inter- cluster dependencies

Algorithms Pairwise MI-based greedy clustering –Rank vector component pairs by MI and choose combination of pairs that maximizes overall MI. Linear entropy minimization –Choose clusters whose linear entropy, normalized by the size of the cluster, is the lowest. Maximum clique quantization –Based on MI graph connectivity

Experiments and Results Quantized parameters: means and variances of Gaussian distributions. Database: PHONEBOOK, a collection of words spoken over the telephone Baseline word error rate (WER): 2.42% Memory savings: ~ 85% reduction (from 400KB to 50KB) Best schemes: –Normalized joint scalar quantization, disjoint scalar quantization. –Schemes such as entropy minimization and the greedy algorithm did well in terms of error rate but at the cost of a higher memory usage.

Outline Overview of speech recognition Sub-vector clustering and parameter quantization Custom arithmetic back-end Power simulation

Custom Arithmetic IEEE Floating-point –Pros: precise data representation and arithmetic operations –Cons: expensive computation and high bandwidth Fixed-point DSP –Pros: relatively efficient computation and low bandwidth –Cons: loss of information potential overflows Still not efficient in operation and bandwidth use Custom arithmetic via table look-ups –Pros: compact representation with varied bit-widths fast computation –Cons: loss of information due to quantization overhead storage for tables complex design procedure

General Structure Idea: replace all two-operand floating-point operations with customized arithmetic via ROM look-ups –Example: Procedure: –Codebook design Each codebook corresponds to a variable in the system The bit-width depends on how precise the variable has to be represented –Table design Each table corresponds to a two-operand function The table size depends on the bit-widths of the indices and the entries

Custom Arithmetic Design for the Likelihood Evaluation Issue: bounded accumulative variables –Accumulating iteratively with a fixed number of iterations –Large dynamic range, possibly too large for single codebook Solution: binary tree with one codebook per level Y X Y t+1 = Y t + X t+1 t = 0, 1, …,D

Custom Arithmetic Design for the Viterbi Search Issue: unbounded accumulative variables –Arbitrarily long utterances; unbounded number of recursions –Unpredictable dynamic range, bad for codebook design Solution: normalized forward probability –Dynamic programming still applies –No degradation in performance –A bounded dynamic range makes quantization possible

Optimization on Bit-Width Allocation Goal: –find the bit-width allocation scheme (bw 1, bw 2, bw 3, …, bw L ) which minimizes the cost of resources while maintaining the baseline performance Approach: greedy algorithms –Optimal: intractable –Heuristics: Initialize (bw 1, bw 2, bw 3, …, bw L ) according to single-variable quantization results. Increase the bit-width of the variable which gives the best improvement concerning both performance and cost, until the performance is as good as the baseline

Three Greedy Algorithms Evaluation method: gradient Algorithms –Single-dimensional increment based on static gradient –Single-dimensional increment based on dynamic gradient –Pair-wise increment based on dynamic gradient

Results on Single-Variable Quantization

Results Likelihood evaluation: –Replace floating-point processor with only 30KB for table storage, while the baseline recognition rate was maintained –Reduce the offline storage for model parameters from 400KB to 90 KB –Reduce the memory requirement for online recognition by 80% Viterbi search: Currently we can quantize forward probability with 9 bits; Can we hit 8 bits?

Outline Overview of speech recognition Sub-vector clustering and parameter quantization Custom arithmetic back-end Power simulation

Simulation Environment SimpleScalar Cycle-level performance simulator · 5-stage pipeline · PISA instruction set (superset of MIPS-IV) · Execution-driven · Detailed statistics Wattch Parameterizable power model · Scalable for processes · Accurate to 10% versus lower-level models Binary program Hardware config Hardware access statistics Power estimate Performance estimate

Our New Simulator ISA extended to support table look-ups Three-operand instructions but need 4 values for quantization –Two inputs –Output –Table to use Two options proposed: –One-step look-up -- different instruction for each table –Two-step look-up Set active table, used by any quantizations until reset Perform look-up

Future Work Immediate future –Meet with architecture groups to discuss relevant implementation details –Determine power parameters for look-up tables Next steps –Generate power consumption data –Work with other groups for final implementation