A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering PROGRESS UPDATE SUMMER 2010 A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering Vaibhav Karkare vaibhav@ee.ucla.edu
Spike Sorting Spike sorting: The process of classifying action potentials according to the source neurons Detection (D) & Alignment (A) Feature Extraction (FE) Clustering (C) 2
Spike-Sorting DSP Chip Technology 1P8M 90-nm CMOS Core VDD 0.55 V Gate count 650 k Clock domains 0.4 MHz, 1.6 MHz Power 2 µW/channel Data reduction 91.25 % No. of channels 16, 32, 48, 64 SNR −2.2 dB Median PD 86 % 87 % PFA 1 % 5 % Class. accuracy 92 % 77 % 64-Channel Spike-Sorting DSP
Previous Work None of the previous DSPs support online clustering Reference JNE ’07 JSSC ’05 ISSCC ’08 ISCAS ’09 ASSCC ‘09 No. of Channels 96 32 1 128 64 Power (μW/channel) 104 75 100 14.6 2.03 Area (mm2/channel) - 0.11 1.58 0.01 0.06 Power density (μW/mm2) 680 60 1460 30 Process (nm) FPGA 500 350 90 Core voltage (V) 3 3.3 1.08 0.55 Detection ü Alignment × Feature extraction None of the previous DSPs support online clustering
Importance of Online Clustering Several applications require on-the-fly spike sorting Spike sorting is not complete until clustering is implemented Latencies of offline clustering cannot be accepted for real-time, multi-channel recordings Example: Brain-Computer Interface Clustering provides 240-times reduction in data-rate when compared to raw data transmission Will reduce transmit power by ~240x Transmit power is dominant for a multi-channel system which transmits “wide band” neural data 48 samples/spike 8 bits/sample = 384 bits /spike With clusters only cluster id of 4 bits (for supporting 16 neurons) needs to be transmitted = 4 bits/spike 384/4 = 96x reduction wrt spike transmission Detection vs raw data: Raw data bits/sec = 24,000*8 = 192000 bps. With spike id transmission 100*4 = 400 bps => 480x reduction
Challenges in Online Clustering Conventional clustering algorithms are developed for offline clustering Examples: k-means, fuzzy-c-means, super-para-magnetic clustering, valley seeking Data storage of a few TB is required Infeasible for on-chip implementation Online sorting algorithm developed at CalTech Available as a part of Osort software package Collaborators use this software Only algorithm amenable to hardware implementation
Online Clustering Algorithm If d < Threshold If d > Threshold #1 #2 centroid assign create 1st data point 2nd data point d cluster #1 #3 If dmin < Threshold If dmin > Threshold #1 #2 assign create 3rd data point dmin Nth data point dmin If dmin < Threshold→ merge
Direct-Mapped Implementation Large memory requirement for low-power, multi-channel DSP implementation We need 14 kb/channel for storage of cluster means A 224 kb SRAM for 16 channels consumes 1.12 mA of leakage current Each distance calculation entails 95 addition operations and 48 squaring operations Up to 1936 distance computations may be needed for an incoming spike Need to revisit the algorithm to identify simplifications for an implantable ASIC solutions
Template Matching for Clustering Template-matching based classification Osort implemented sequentially Template matching for multi-channel, real time Advantages 14 kb (training) + 1.9 kb*N of memory 44*N for direct-mapped design Max. 6 dist. computations / spike for temp. matching Max. 1936 dist. computations / spike for temp. matching Scalable Design
Computational Simplifications Use L1-norm instead of L2-norm Approximate cluster mean calculation Approximate merged-mean calculation
Error Tolerance in Clustering Condition on error in cluster mean computation Valid for any source of error Evaluation of simplifications based on 600+ data sets of simulated neural data Accuracy/ Simplifications Median Mean None 0.72 0.71 L1 Norm 0.87 0.77 Cluster Mean 0.88 Cluster Merge 0.85 0.76 Template matching
Osort Chip Architecture Fully Synchronous Design “Training Required” indicator Parallel training and template matching External / Internal threshold for clustering
Architecture Analysis Assumptions for regular E-D analysis are not valid Fixed operating frequency Register dominance Separate logic and flip-flop memory modules Use HVT for flip-flops, SVT for logic Reduced supply voltage for memory Level conversion between memory and logic modules
Flip-flop based memories DFF-based memory as opposed to SRAM Operation at reduced voltages Up to 5-times lower leakage Delay-line based clock Data is not shifted each cycle Clock is valid only for one register in entire memory
Serial Processing of Parallel Data Implement serial processing at a faster clock Reduces logic leakage Would not be possible for direct-mapped, multi-channel implementation
16-Channel Spike-Sorting DSP with On-the-Fly Clustering Chip Summary Technology 65-nm Core VDD 0.5 V / 0.3 V Clock rate 384 kHz CA 82 % Power 100 µW Data reduction 240 x # Channels 16 Area 2.45 mm2 Power Density 40.8 µW/mm2 16-Channel Spike-Sorting DSP with On-the-Fly Clustering
Sarah Gibson, Chia-Hsiang Yang, and Victoria Wang Conclusions Demonstrated first spike-sorting DSP with multi-channel, on-the-fly clustering DSP consumes 100 µW of power and occupies 2.45 mm2 in a 65-nm 1P8M CMOS process A 240-times reduction is obtained in output data-rate when compared to raw data transmission Template-matching based clustering is implemented with simplified online sorting for template identification Fully synchronous, serialized architecture is used to reduce the dominant static power consumption Acknowledgments Sarah Gibson, Chia-Hsiang Yang, and Victoria Wang
Questions / Comments?