A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering

Slides:



Advertisements
Similar presentations
Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.
Advertisements

INPUT-OUTPUT ORGANIZATION
Thank you for your introduction.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
COMPARISON OF ADAPTIVE VOLTAGE/FREQUENCY SCALING AND ASYNCHRONOUS PROCESSOR ARCHITECTURES FOR NEURAL SPIKE SORTING J. Leverett A. Pratt R. Hochman May.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Spike Sorting Algorithm implemented on FPGA Elad Ilan Asaf Gal Sup: Alex Z.
DSP online algorithms for the ATLAS TileCal Read Out Drivers Cristobal Cuenca Almenar IFIC (University of Valencia-CSIC)
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Spike Sorting Algorithm Implemented on FPGA Elad Ilan Asaf Gal Sup: Alex Zviaginstev.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
INPUT-OUTPUT ORGANIZATION
Low-Power Wireless Sensor Networks
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Physical-layer Identification of UHF RFID Tags Authors: Davide Zanetti, Boris Danev and Srdjan Capkun Presented by Zhitao Yang 1.
By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.
7/6/99 MITE1 Fully Parallel Learning Neural Network Chip for Real-time Control Students: (Dr. Jin Liu), Borte Terlemez Advisor: Dr. Martin Brooke.
Click to edit Master title style Analysis and Modeling of Neural-Recording ADC Advisors: Wolfgang Eberle Vito Giannini Dejan Markovic Progress Update :
DEFENSE EXAMINATION GEORGIA TECH ECE P. 1 Fully Parallel Learning Neural Network Chip for Real-time Control Jin Liu Advisor: Dr. Martin Brooke Dissertation.
Click to edit Master title style Analysis and Modeling of Neural-Recording ADC Mentors: Wolfgang Eberle Vito Giannini Progress Update : Summer Internship.
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Waseda University Low-Density Parity-Check Code: is an error correcting code which achieves information rates very close to the Shanon limit. Message-Passing.
Xilinx V4 Single Event Effects (SEE) High-Speed Testing Melanie D. Berg/MEI – Principal Investigator Hak Kim, Mark Friendlich/MEI.
Physical Memory and Physical Addressing ( Chapter 10 ) by Polina Zapreyeva.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Programmable Logic Devices
Power-Optimal Pipelining in Deep Submicron Technology
3506-D WEST LAKE CENTER DRIVE,
CS203 – Advanced Computer Architecture
High-Speed Stochastic Circuits Using Synchronous Analog Pulses M
Digital readout architecture for Velopix
VLSI Testing Lecture 5: Logic Simulation
Computer Architecture & Operations I
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
VLSI Testing Lecture 5: Logic Simulation
R&D activity dedicated to the VFE of the Si-W Ecal
Very low voltage 16-bit counter in high leakage static CMOS technology
Vishwani D. Agrawal Department of ECE, Auburn University
1 Input-Output Organization Computer Organization Computer Architectures Lab Peripheral Devices Input-Output Interface Asynchronous Data Transfer Modes.
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Deep Neural Network with Stochastic Computing
Cache Memory Presentation I
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Ultra-Low-Voltage UWB Baseband Processor
Lecture 7: Introduction to Distributed Computing.
Hardware Masking, Revisited
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Serial versus Pipelined Execution
Ka-Ming Keung Swamy D Ponpandi
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Overview of Computer Architecture and Organization
Evaluation of Power Costs in Triplicated FPGA Designs
Lecture 17 Logistics Last lecture Today HW5 due on Wednesday
Interfacing Data Converters with FPGAs
AVR – ATmega103(ATMEL) Architecture & Summary
Modified from notes by Saeid Nooshabadi
Lecture 17 Logistics Last lecture Today HW5 due on Wednesday
PID meeting Mechanical implementation Electronics architecture
Deep neural networks for spike sorting: exploring options
Ka-Ming Keung Swamy D Ponpandi
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Embedded Sound Processing : Implementing the Echo Effect
Presentation transcript:

A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering PROGRESS UPDATE SUMMER 2010 A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering Vaibhav Karkare vaibhav@ee.ucla.edu

Spike Sorting Spike sorting: The process of classifying action potentials according to the source neurons Detection (D) & Alignment (A) Feature Extraction (FE) Clustering (C) 2

Spike-Sorting DSP Chip Technology 1P8M 90-nm CMOS Core VDD 0.55 V Gate count 650 k Clock domains 0.4 MHz, 1.6 MHz Power 2 µW/channel Data reduction 91.25 % No. of channels 16, 32, 48, 64 SNR −2.2 dB Median PD 86 % 87 % PFA 1 % 5 % Class. accuracy 92 % 77 % 64-Channel Spike-Sorting DSP

Previous Work None of the previous DSPs support online clustering Reference JNE ’07 JSSC ’05 ISSCC ’08 ISCAS ’09 ASSCC ‘09 No. of Channels 96 32 1 128 64 Power (μW/channel) 104 75 100 14.6 2.03 Area (mm2/channel) - 0.11 1.58 0.01 0.06 Power density (μW/mm2) 680 60 1460 30 Process (nm) FPGA 500 350 90 Core voltage (V) 3 3.3 1.08 0.55 Detection ü Alignment × Feature extraction None of the previous DSPs support online clustering

Importance of Online Clustering Several applications require on-the-fly spike sorting Spike sorting is not complete until clustering is implemented Latencies of offline clustering cannot be accepted for real-time, multi-channel recordings Example: Brain-Computer Interface Clustering provides 240-times reduction in data-rate when compared to raw data transmission Will reduce transmit power by ~240x Transmit power is dominant for a multi-channel system which transmits “wide band” neural data 48 samples/spike 8 bits/sample = 384 bits /spike With clusters only cluster id of 4 bits (for supporting 16 neurons) needs to be transmitted = 4 bits/spike 384/4 = 96x reduction wrt spike transmission Detection vs raw data: Raw data  bits/sec = 24,000*8 = 192000 bps. With spike id transmission 100*4 = 400 bps => 480x reduction

Challenges in Online Clustering Conventional clustering algorithms are developed for offline clustering Examples: k-means, fuzzy-c-means, super-para-magnetic clustering, valley seeking Data storage of a few TB is required Infeasible for on-chip implementation Online sorting algorithm developed at CalTech Available as a part of Osort software package Collaborators use this software Only algorithm amenable to hardware implementation

Online Clustering Algorithm If d < Threshold If d > Threshold #1 #2 centroid assign create 1st data point 2nd data point d cluster #1 #3 If dmin < Threshold If dmin > Threshold #1 #2 assign create 3rd data point dmin Nth data point dmin If dmin < Threshold→ merge

Direct-Mapped Implementation Large memory requirement for low-power, multi-channel DSP implementation We need 14 kb/channel for storage of cluster means A 224 kb SRAM for 16 channels consumes 1.12 mA of leakage current Each distance calculation entails 95 addition operations and 48 squaring operations Up to 1936 distance computations may be needed for an incoming spike Need to revisit the algorithm to identify simplifications for an implantable ASIC solutions

Template Matching for Clustering Template-matching based classification Osort implemented sequentially Template matching for multi-channel, real time Advantages 14 kb (training) + 1.9 kb*N of memory 44*N for direct-mapped design Max. 6 dist. computations / spike for temp. matching Max. 1936 dist. computations / spike for temp. matching Scalable Design

Computational Simplifications Use L1-norm instead of L2-norm Approximate cluster mean calculation Approximate merged-mean calculation

Error Tolerance in Clustering Condition on error in cluster mean computation Valid for any source of error Evaluation of simplifications based on 600+ data sets of simulated neural data Accuracy/ Simplifications Median Mean None 0.72 0.71 L1 Norm 0.87 0.77 Cluster Mean 0.88 Cluster Merge 0.85 0.76 Template matching

Osort Chip Architecture Fully Synchronous Design “Training Required” indicator Parallel training and template matching External / Internal threshold for clustering

Architecture Analysis Assumptions for regular E-D analysis are not valid Fixed operating frequency Register dominance Separate logic and flip-flop memory modules Use HVT for flip-flops, SVT for logic Reduced supply voltage for memory Level conversion between memory and logic modules

Flip-flop based memories DFF-based memory as opposed to SRAM Operation at reduced voltages Up to 5-times lower leakage Delay-line based clock Data is not shifted each cycle Clock is valid only for one register in entire memory

Serial Processing of Parallel Data Implement serial processing at a faster clock Reduces logic leakage Would not be possible for direct-mapped, multi-channel implementation

16-Channel Spike-Sorting DSP with On-the-Fly Clustering Chip Summary Technology 65-nm Core VDD 0.5 V / 0.3 V Clock rate 384 kHz CA 82 % Power 100 µW Data reduction 240 x # Channels 16 Area 2.45 mm2 Power Density 40.8 µW/mm2 16-Channel Spike-Sorting DSP with On-the-Fly Clustering

Sarah Gibson, Chia-Hsiang Yang, and Victoria Wang Conclusions Demonstrated first spike-sorting DSP with multi-channel, on-the-fly clustering DSP consumes 100 µW of power and occupies 2.45 mm2 in a 65-nm 1P8M CMOS process A 240-times reduction is obtained in output data-rate when compared to raw data transmission Template-matching based clustering is implemented with simplified online sorting for template identification Fully synchronous, serialized architecture is used to reduce the dominant static power consumption Acknowledgments Sarah Gibson, Chia-Hsiang Yang, and Victoria Wang

Questions / Comments?