Download presentation
Presentation is loading. Please wait.
Published byElmer Flowers Modified over 9 years ago
1
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel Project supervised by: Ina Rivkin Winter 2008
2
This project is a part of a much larger project dealing with signal processing acceleration using hardware. In our project we created a hardware acceleration to a given algorithm and analyzed the advantages and disadvantages of such system, in comparison to a pure software implementation.
3
Running the signal processing algorithm takes too much time using a software on a standard PC. A system designed especially for that target, using multiple processors and management unit
4
Simulator- a program running on the host PC responsible to generate data packets, sending them for processing and retrieving back the results. Processing units, each running the same signal processing program upon the incoming packets. Switch- responsible of the correct transfer of data between the host PC and the processing units and vise versa.
5
switch PCI BUS Processor I Processor N Processor II On chip memory On chip memory On chip memory Stratix II FIFO_IN Board memory FIFO_OUT Board memory Gidel ProcStar II Data Packages Generator
6
Building the above system, understanding multiprocessors issues. Learning the tools and techniques for building such complex systems. Optimizing the system configuration, in a search for the ideal NiosII type and number Finding the optimal configuration in which throughput is brought to maximum Performances comparison between working with PC and working with the system
7
The project is un integration between tow levels: software and hardware: The software level is composed of Vitaly’s packet processing algorithm and the HOST program which generate the data packets and retrieves the results. The hardware level, implemented on the PROC board, includes the switch, the processing units and GIDEL IP’s
8
This program generates vectors of Time Of Arrivals (TOA), each made up of a basic chain with a specific period and noise elements. Every vector is wrapped with header and tail used for identification; control signals and synchronization The packet structure:
9
The algorithm job is to recognize such basic chains in the incoming vectors and to associate each TOA element to his chain. The results send back to the simulator in the following packet structure:
10
Hardware level is implemented on a Gidel PROCStar II board. 4 powerful Stratix II FPGA each annexed to 2 external DDRII. The packets are sent to the processing units via the PCI bus the packets are stored on the 2 external memories, which are configured to act as FIFO’s.
11
The switch, designed by Oleg & Maxim, manages the data transfer between the host PC and the multiple processing units. The switch is composed of the following main modules: Input reader- reads packets from FIFO_IN to processing units Output writer-writes the answers from the processing units to FIFO_OUT Main controller- as it name implies- issues all control signals required to the correct transfer of data Clusters- a wrapper around the processing units used to give another abstraction layer to the system.
13
Management policy: FCFS for input packets, RR for output packets Statistics reporter Error reporter up to 16 clusters.
14
Switch has up to 16 clusters inside. Same cluster is duplicated many times to create a multi Nios system Switch ports: Every cluster has one processing unit, as seen in the next slide Cluster ports:
16
1 NiosII CPU 12 KB on chip memory for code, stack and heap 2 4KB buffers used by the algorithm to build the histograms 4KB buffer for input packets dual port, also mastered by the switch 1KB buffer for output packets dual port, also mastered by the switch Timer.
18
input vector and output vector- the connection to the switch. Without their ports no ack/req protokols could be implemented The modules “export” signals, would be connected to the cluster, as seen in the “cluster structure” slide.
19
Duplicating the clusters inside the switch would create a multi Nios system. The switch support up to 16 clusters. This example include 14 Nios s Logic utilization is only 20% While almost all ram blocks are used, memory utilization is only 33%, mainly because M-RAM cells are ineffectively used.
20
Gidel IP- The MegaFIFO provide a simple and convenient way to transfer data to/from Gidel PROC board. In this system there are tow FIFO’s : FIFO_IN for the incoming packets and FIFO_OUT for the processed packets. To access those memories the host uses Gidel predefined HAL functions while the hardware uses ack/req protocol. Gidel IP- Register used to transfer data from hardware to software and vise versa. In this system- they are used for error and statistics reporting.
21
In the PROCWizard tool we define the top level entity of the design. It generates the HDL code for the design and an H file for the host. We can see here the definition of one IC (FPGA), tow FIFO’s, some registers and the LBS module.
22
Basic system: 1 NiosII s (s for standard) system 1 simulator we built 1 algorithm
23
3 methods: Timer module- inside the sopc, used as timestamp. Resolution: 10 us. Statistics reporter module- counts packet entering and exiting the system. As long as their numbers are not identical it counts clock cycles and info register has the value of 128. Resolution: 0.01 us. Software timer- initiated by the host from the moment it writes the data to the moment info register is zero- indicating all packets returned. Resolution: 1 us. Later we will demonstrate how 3 methods converge
24
Computing time as a function of TOA number. Computing time= O(n^2) %Absent=0 %noise=0
25
Computing time as a function of % absent Around 6% the algorithm finds more then 1 sequence, with double frequencies… %noise=0
26
Computing time as a function of %noise
27
According to the above results, we choose an average vector to check different systems Vector length: 495 % absent= 4% % noise= 25%
28
In order to decide what Nios configuration to use we checked them with the same vector The economic CPU needs little space on the FPGA, but has poor performance The fast version has some advantage over the simple CPU, but needs a lot more FPGA resources, and so we choose the simple one
29
The CPUs are independent and so doubling their number doubles the performance There is no major overhead for adding CPUs
30
In order to come to final conclusions we sent 10k random vectors to both PC and most powerful FPGA system The PC does the job 7.64 times faster
31
No! The Nios CPUs we used are no match for the PC Pentium CPU, but there are a few ways to get better performances 1. Increasing the system’s 100 MHz clock rate 2. Adding an accelerator unit to each CPU 3. Shrinking the code lines from 8.5kbytes to 8kbytes and by that gaining more Ram cells for more CPUs 4. Optimize utilization of Mram cells.
32
Ina Rivkin Lab staff- Eli Shoshan and Moni Orbach Oleg and Maxim Vitaly Michael and Liran
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.