Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels Paper by E.M. Ortigosa , A. Canas, E.Ros, P.M. Ortigosa, S. Mota , J.

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems part 5: Special and weird ‘processor’

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.

Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Neural Networks Lab 5. What Is Neural Networks? Neural networks are composed of simple elements( Neurons) operating in parallel. Neural networks are composed.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Lecture # 12 University of Tehran

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.

Artificial Neural Network Theory and Application Ashish Venugopal Sriram Gollapalli Ulas Bardak.

Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.

 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.

NEURAL NETWORKS FOR DATA MINING

A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Neural Networks in Computer Science n CS/PY 231 Lab Presentation # 1 n January 14, 2005 n Mount Union College.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

1 Fundamentals of Computer Science Combinational Circuits.

LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?

ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.

A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Introduction to the FPGA and Labs

Neural networks.

Backprojection Project Update January 2002

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Neural Network Implementations on Parallel Architectures

UNIVERSITY OF MASSACHUSETTS Dept

Microprocessor and Assembly Language

Artificial Intelligence (CS 370D)

Other Classification Models: Neural Network

By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014

Instructor: Dr. Phillip Jones

Electronics for Physicists

Application Development Theory

FPGA Implementation of Multicore AES 128/192/256

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Interfacing Memory Interfacing.

Programmable Logic Devices: CPLDs and FPGAs with VHDL Design

Neural Networks: Improving Performance in X-ray Lithography Applications ECE 539 Ryan T. Hogg May 10, 2000.

Compiler Construction

Pipelining and Vector Processing

Field Programmable Gate Array

ECE 434 Advanced Digital System L13

Field Programmable Gate Array

Hidden Markov Models Part 2: Algorithms

of the Artificial Neural Networks.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

UNIVERSITY OF MASSACHUSETTS Dept

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Electronics for Physicists

UNIVERSITY OF MASSACHUSETTS Dept

ECE 352 Digital System Fundamentals

EGR 2131 Unit 12 Synchronous Sequential Circuits

ECE 352 Digital System Fundamentals

Registers Today we’ll see some common sequential devices: counters and registers. They’re good examples of sequential analysis and design. They are also.

The Network Approach: Mind as a Web

CprE / ComS 583 Reconfigurable Computing

David Kauchak CS158 – Spring 2019

A unified extension of lstm to deep network

♪ Embedded System Design: Synthesizing Music Using Programmable Logic

Presentation transcript:

Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels Paper by E.M. Ortigosa , A. Canas, E.Ros, P.M. Ortigosa, S. Mota , J. Dı´az Paper Review by, Ryan MacGowan This paper was written in 2006

Topics What does this even mean?? Speech Recognition Artificial Neural Networks Solutions Results Likes and Dislikes Conclusion - When I first read the title, I was a little confused, so The first thing that I am going to talk about is the actual meaning of the title of this paper. Next I am going to talk about speech recognition and why we are developing this solution Then I am going to talk artificial neural networks and the neurons that make them what they are Solutions and results Likes and dislikes Conclusion

Multi-Layer Perception Multi-Layer Perception (referred to as MLP) is a type of standard feed forward neural network which uses at least 3 layers (input, hidden, and output) Multi-Layer Perception is simply a standard feed-forward neural network which contains an input, hidden, and output layer

Abstraction Levels Different abstraction levels just means that the solution will be realized using 2 different methods, one using low-level VHDL, and another using higher-level Handel -C In the case of this network, the different abstraction levels just means that the design in created using 2 methods, LOW level VHDL, and HIGH Level Handel-C

Translation Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels …… REALLY MEANS ………. FPGA Implementation of two Neural Networks using VHDL, and Handel-C, to solve a problem In this case the problem that is solved is Speech Recognition So, to translate this title into something that is easier to understand… This paper is about FPGA implementation of a neural network using VHDL and Handel C to solve a problem The problem this paper is addressing is Speech Recognition

Speech Recognition Due to the increasing power of FPGAs, solutions for Speech Recognition can be designed using an Artificial Neural Network built right into the FPGA. Useful for applications in cars, GPS, toys, and other embedded systems where control over speech would be useful A computer samples audio, and this waveform is converted into a vector using vector bank and prediction analysis. This vector is what is sent to the neural network. The Neural Network computes which word was spoken as the output -FPGAs and Neural Networks are both performance oriented techniques, they lend themselves well to Speech recognition -

Artificial Neural Network All solutions were realized using Artificial Neural Network presented here. 10 vectors with 22 features are 220 input data values, sent to the 24 hidden neurons for computation, whose output are sent to the output neurons which classify the input and provide an out If a spoken command falls in a class, we expect that output node to give a high value -You can see that 220 inputs are sent to the hidden neurons, and 24 outputs from these hidden neurons are passed to the output neurons -At the top you can see the make-up of each of these neurons , which the functional unit, and the activation function which provides the output

MLP Hardware Implementation In order to ensure maximum accuracy, a number of neural network structures were tested. The best result of 96.3% accuracy was present when 24 hidden neurons were used.

The Artificial Neuron (Functional Unit) This is the functional unit used in the implementations. The first 8 bit input is the input value, and the second represents the connection weight for that input The output of the multiplier is sign extended due to the maximum size of the summation of weight values in EQ1 So the first part of the neuron is the Functional unit Takes 2, 8-bit inputs, the input value and the input connection weight These values are multiplied to give a 16-bit output Since up to 220 inputs and weights are multiplied, and all of these values are added togeather, a sign extension of 23 bits is required to make sure the sum can be computed

The Artificial Neuron (Activation Function) The output of this functional unit presented in the previous slide is then passed into the sigmoid activation function, which gives an 8-bit output based on the 23-bit input This 8-bit output can either be passed to another layer of hidden neurons, or passed to the output neurons The second part of the neuron computation involves the activation function. This activation function is known as a Sigmoid Activation function, and it’s output is shown in the following graph A sigmoid activation function is useful as it can be used in many applications, and provides an output that is between 0 and 1

Serial Design - So here we see the serial design. This design uses a single weight RAM, Functional Unit, and Activation Function. - 2 counters are used, a 5 bit one to represent the address for the hidden neurons , and one to represent the addresses for the input values For the RAM weights, both addresses need to be combined in order to be able to reference both sets of weight values This serial design has a serious bottlneck at the RAM and Functional unit, as each value for the sum must be calculated sequentially For the hidden later, this functional unit must look through 220 times for EACH neuron, and each neuron must be summed sequentially Once each neuron sum is calculated , it is sent through the activation function and then stored in the hidden RAM input

Parallel Design We can enhance the speed of the design by placing the Ram containing the weights , and the functional units in parallel The output sums from these functional units are stored in a register, and selected by the multiplexer to be sent to the activation function This is only a partial parallel design as the outputs of each layer are still computer sequentially - For the parallel design, we can see that some of the bottlenecks have been eliminated by placing the weighted RAM for each neuron in parallel, a long with 24 functional units witch are also in parallel This allows the sums for each of the 24 neurons sums to be calculated in parallel If we consider the last case with the hidden neuron layer, the calculation of the sum will be done 24X quicker, as it only takes 220 iterations to get the sum for every neuron For this design, 24 individual registers are used instead of a RAM block because this allows quicker access times Results presented later will show a massive increase in speedup

Handel-C Design The Handel-C design is done in both serial and parallel, as shown below Serial Parallel NumHidden is the number of hidden neurons(24), NumInput is the number of input values (220), W is the array containing the weights, In is the input array, and Sum is the sum of the weights multiplied by the inputs As we can see here, the Handel C design allows extraordinarily easier parallelism due to the ability to use PAR statements

Handel-C RAM location In order to try to test more solutions and come up with an optimal one, different RAM types were used in the Handel-C Design: 1) Only Distributed RAM Blocks 2) A combination of Embedded and Distributed RAM Blocks 3) Only Embedded RAM Blocks Three types of RAM can be used for Handel-C Distributed RAM blocks which makes use of the LUTs on contained in the CLBs A combination of embedded RAM blocks and distributed RAM, where the Embedded RAM is used to contain the WEIGHTS for the large array (220X24 input weights), and the distributed is used for the rest Only embedded RAM blocks

Results Here are the results -Here we can see the results from both implementations. -The top table shows the VHDL implementation results -The bottom table shows the Handel–C results -As you can see, the parallel implementations provide a speedup universally of approximately 17X -The VHDL implementation has about 1.2X higher throughput than the Handel-C implementation -The VHDL implementation uses at least 30% less resources For the Handel C design, we see that the design using the distributed RAM is the best, followed by the combination RAM, and finally the Embedded RAM design. This is due to the embedded RAMs capacity not being able to be used properly, as one neuron RAM entry takes slightly over 50% As the resources and systems gates decrease, a higher frequency is able to be obtained due to less propagation delay - The frequency is the major factor in the change in throughput between designs

Throughput of the Designs When looking at the throughput, the VHDL design provides the best in both serial and parallel cases. HC(a) provides the highest throughput due to the higher frequency that can be achieved

Performance Cost This trend continues as we also examine the performance cost. The VHDL design is the best as it provides the highest throughput while also using the least number of gates Handel-C designs have at least a 1.6X higher performance cost

Number of Neurons These graphs show the linear relationship between the number of neurons in the hidden layer and the amount of resources used Since our design only uses 24 hidden neurons, the resources are manageable, however our design is also very small (only 10 words) Here we can see how the number of hidden neurons directly relates to the amount of resources required. On the FPGA used in this study, only a 64-neuron design could be implemented due to memory contraints

Downsides of VHDL Design While these tables and graphs above show that the VHDL design is superior in terms of throughput and performance cost, there are also drawbacks which we must consider. Design time is 10X longer on the VHDL design Exploring different solutions with a VHDL Design takes considerably longer as it requires an entire new control unit to be designed each time.

Hardware Used The FPGA used is a Virtex-E 2000 CLBs contain 4 LCs(Logic Cells), each with 4- input function generator, carry logic, storage element The 4 LCs are placed in 2 slices, each slice provides 5-6 input function generators Each LC has a 4-input LUT which provides a 16x1 memory block Also contains EMB memory blocks

Software Used VHDL was coded with FPGA Advantage Tool 5.3 from Mentor Graphics DK Design Suite was used for Handel-C implementation Both designs were placed and routed using ISE Foundation Tool 3.5i

Likes The relevance to labs performed in the course The comparison between parallel and serial for all types of implementation Great description of neural network covering all aspects The application is very practical with todays electronic culture

Dislikes More detail on actual voice recognition system – specifically the computer used to preprocess the audio. Paper is not entirely modern (2006) Some sources used are even older (1990-1993) The system is unrealistically small (10 different words), with no discussion about viability in more complex environment Not much information given on VHDL. No pseudocode given, no simulations given.

Conclusion Using Neural Networks for Speech recognition allows compact embedded systems to be developed Parallel processing allows a speedup of up to 17X over a serial implementation VHDL results in an implementation which is 1.21-1.24X faster than the Handel-C implementation When using Handel-C, it is important to know the most optimized type of RAM for your application, in this case being distributed RAM. Handel C designs have a 1.6X higher performance cost Computation of output takes only 13-16ms

Questions?