Mahdi Nazemi, Shahin Nazarian, and Massoud Pedram July 10, 2017 High-Performance FPGA Implementation of Equivariant Adaptive Separation via Independence Algorithm for Independent Component Analysis Mahdi Nazemi, Shahin Nazarian, and Massoud Pedram July 10, 2017
Outline Independent Component Analysis (ICA) Motivations for Using ICA Equivariant Adaptive Separation via Independence (EASI) Algorithm Proposed Algorithm and Hardware Implementation Results
Independent Component Analysis (ICA) ICA can be defined as estimation of a generative model 𝒙 𝑚×1 = 𝑨 𝑚×𝑛 𝒔 𝑛×1 𝑚 ≥𝑛 𝒙: observed random variables 𝑨: mixing matrix 𝒔: independent components (ICs) Objective: estimate both mixing coefficients and ICs Another variation: 𝒚 𝑛×1 = 𝑩 𝑛×𝑚 𝒙 𝑚×1 𝑩: separation matrix 𝒚: estimates of ICs ICA allows feature extraction, i.e. to keep features that explain the essential structure of the data. Cocktail party problem
Outline Independent Component Analysis (ICA) Motivations for Using ICA Equivariant Adaptive Separation via Independence (EASI) Algorithm Proposed Algorithm and Hardware Implementation Results
Bayesian Neural Networks (BNNs) Inputs, weights, or outputs follow a probability distribution function Sampling dependent features is complicated and computationally expensive ICA finds independent components ⇒ simplifies sampling RNGs for sampling: Wallace, Ziggurat
ICA for Dimensionality Reduction Preprocessing step for transforming the original problem into a smaller problem suitable for hardware implementation MNIST dataset Input features: 784 → 32 (~25x ↓) Accuracy: 0.23% ↓ not only reduces redundancy, but also better for hardware
Outline Independent Component Analysis (ICA) Motivations for Using ICA Equivariant Adaptive Separation via Independence (EASI) Algorithm Proposed Algorithm and Hardware Implementation Results
EASI Algorithm Equivariant Adaptive Combines whitening and separation Stochastic Gradient Descent (SGD) optimization Combines whitening and separation Only requires addition and multiplication 𝒙 𝑘 : observations (m) 𝑩 𝑘 : separation matrix 𝒚 𝑘 : output features (n) 𝒚 𝑘 = 𝑩 𝑘 𝒙 𝑘 Nonlinearity 𝐠 (𝒚 𝑘 )= 𝑦 𝑘 3 𝑩 𝑘+1 = 𝑩 𝑘 −𝜇𝐻 𝑩 𝑘 𝑯=𝑰− 𝒚 𝑘 𝒚 𝑘 𝑇 + 𝒈(𝒚 𝑘 ) 𝒚 𝑘 𝑇 − 𝒚 𝑘 𝒈( 𝒚 𝑘 ) 𝑇 𝒙 𝑘 Repeat until convergence loop-carried dependency S1 S2 S3 S4
Hardware Implementation
Shortcomings of Existing Implementations Clock frequency/throughput is low each training sample has to wait for the immediately preceding sample to update model parameters Clock frequency or throughput decreases by increasing 𝑚 and 𝑛 [Meyer-Baese, SPIE ’15]
Outline Independent Component Analysis (ICA) Motivations for Using ICA Equivariant Adaptive Separation via Independence (EASI) Algorithm Proposed Algorithm and Hardware Implementation Results
Proposed Algorithm S1 S2 S3 S4 S5 𝒙 𝑘 𝑝 𝒙 𝑘 : observations (m) 𝒚 𝑘 𝑝 = 𝑩 𝑘 𝑝 𝒙 𝑘 𝑝 Nonlinearity 𝐠( 𝒚 𝑘 𝑝 )= ( 𝒚 𝑘 𝑝 ) 3 𝑩 𝑘 𝑝+1 = 𝑩 𝑘 𝑝 − 𝑯 𝑘 𝑝 𝑩 𝑘 𝑝 𝑯 𝑘 𝑝 =𝑰− 𝒚 𝑘 𝑝 ( 𝒚 𝑘 𝑝 ) 𝑇 +𝑔( 𝒚 𝑘 𝑝 ) ( 𝒚 𝑘 𝑝 ) 𝑇 − 𝒚 𝑘 𝑝 𝒈( 𝒚 𝑘 𝑝 ) 𝑇 𝒙 𝑘 𝑝 𝑯 𝑘 𝑝 = 𝛾 𝑯 𝑘−1 𝑃 +𝜇 𝑯 𝑘 𝑝 , 𝑝=0 &𝛽 𝑯 𝑘 𝑝−1 +𝜇 𝑯 𝑘 𝑝 , 0<𝑝<𝑃 increment p S1 S2 S3 S4 S5 𝒙 𝑘 : observations (m) 𝑩 𝑘 : separating matrix 𝒚 𝑘 : output features (n) 𝑘: index of mini-batch 𝑝: index of training sample within a mini-batch 𝑃: mini-batch size Initialize 𝐵 0 0 randomly At the beginning of each mini-batch: initialize 𝑝 to 0 initialize 𝑯 𝑘 𝑝 to a zero matrix
Hardware Implementation equations for critical path delay and throughput
Outline Independent Component Analysis (ICA) Motivations for Using ICA Equivariant Adaptive Separation via Independence (EASI) Algorithm Proposed Algorithm and Hardware Implementation Results
FPGA Implementation 32-bit floating point variables and operations 𝑚=4 and n=2 ~11x increase in clock frequency ~149x increase in throughput ~23x increase in number of registers Clock frequency is independent of 𝑚 and 𝑛 EASI with SGD EASI with SMBGD Clock frequency (MHz) 4.81 55.17 Throughput (MIPS) 717.21 Adaptive Logic Modules (ALMs) 12731 10350 DSPs (Multipliers) 42 Registers (bits) 160 3648
During Poster Session Later Today Q&A During Poster Session Later Today