Reza Yazdani Albert Segura José-María Arnau Antonio González An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González
Automatic Speech Recognition (ASR) Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
ASR Requirements Voice-based user-interfaces for mobile devices Large Vocabulary Speaker-independent High Accuracy Real-time Performance Energy Efficiency Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
ASR Solutions General-purpose platforms Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Automatic Speech Recognition State-of-the-art ASR system Hybrid model: DNN + HMM Feature Extraction Likelihood Computation \ Graph Search Sound Signal Speech (words) GPU Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Graph Search Dictionary Training Graph Generator Viterbi Search Weighted-Finite-State-Transducer Training Graph Generator Viterbi Search Acoustic model Language model Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Viterbi Search A simple example of WFST for detecting 2 words: three and two Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Viterbi Search THREE 0.3 0.21 Frame 0 Frame 1 Frame 2 Frame 3 0.0015 0.54 0.3 0.0012 0.0009 0.46 0.0018 1.0 Pruning! THREE Pruning! Pruning! Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Accelerated ASR System Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Accelerator’s Architecture Average active states On each frame evaluation: Less than 1%! Viterbi Accelerator WFST Dynamic Search Graph Acoustic Scores Main Memory w1 … 1 2 … 4 6 7 w2 Frame i Frame i+1 Solution: Hash Table w3 w4 State ID Token Info 6 … State Index Token frame t th uw r iy 1 0.9 0.025 2 0.7 0.012 0.25 0.12 3 Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Potential Improvement Perfect caches and hash tables Speedups with respect to the baseline architecture 94.6% Improvement Large Memory Footprint (34million Arcs) Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Hardware Prefetching Dynamic access of a small sparsely distributed subset of arcs On average: 25K out of 34M arcs Conventional prefetchers are inefficient Graph search exhibits unpredictable access pattern Pruning unlikely paths causes more unpredictability Our proposed scheme based on the decoupled access-execute All memory addresses are deterministic after the pruning Issue memory requests much in advance High accuracy: computed rather than predicted addresses Timeliness: reorder-buffer to avoid early evictions 94% speedup with a negligible area overhead of 0.05% Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Bandwidth Reduction 97% of dynamically expanded states have less than 16 arcs A novel technique for directly computing arc addresses Changing the memory layout of the WFST dataset Avoid memory access for fetching state’s data 20% Memory Bandwidth Saving at a negligible cost of 0.02% area increase Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Evaluation Methodology Viterbi accelerator's timing estimation A cycle-accurate simulator Execution and activity factors RTL Verilog model for logic components Design frequency Modeling memory parts with CACTI Cache&Memory latency Power model Memory & Caches: Cacti Logic: Synopsys Design Compiler Technology node: 28nm Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Experimental Results 111.47x Speedup 16.7x Speedup 1185x Reduction Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Outline Motivation Automatic Speech Recognition Accelerated ASR System Memory Subsystem Optimizations Prefetcher Bandwidth Reduction Experimental Results Conclusions Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Conclusion Viterbi search is the main bottleneck in ASR systems General-purpose solutions Not real-time for large speech models High energy consumption Design of an accelerator tailored for the Viterbi Search More energy-efficient (by orders of magnitude) Memory subsystem is the main challenge of ASR Arc prefetcher Memory bandwidth reduction 1.7x faster than NVIDIA GTX 980 and 287x less energy Reza Yazdani An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition
Reza Yazdani Albert Segura José-María Arnau Antonio González An Ultra Low-Power Hardware Accelerator for Automatic Speech Recognition Reza Yazdani Albert Segura José-María Arnau Antonio González