Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor: Gregory Mironov Spring 2004 Project no. D0623
Nowadays complex computations are done on a standard processor or a DSP which aren’t optimal for the matrix inversion. In order to decrease the time spent on matrix inversion tasks we use a specific hardware to do the matrix inversion leaving the CPU free for other tasks and using the faster hardware for the complex computation.
Designing and implementing an FPGA circuitry that inverses a 625x625 matrix.
A standalone system The matrix is of size 625x625 Matrix elements are of type 64 bits double precision floating point Calculation time < 20ms
Suggested Solutions Two algorithms were considered: –Linear algorithm of order O(N^3) –Monte-Carlo algorithm of order O(N^2) The selected hardware was Virtex II Pro The selected algorithm was the Monte-Carlo
The Monte-Carlo Algorithm (simplified version) b i,j := 0; For c := 1 to N do { k 0 := i ; w 0 := 1 ; For t := 1 to T do { k t := MP( k t-1 ) ; w t := sign(d k t-1,k t ) * w t-1 * E k t ; if k t = j then b i,j += w t ; } b i,j /= N ; N – number of markov chains T – length of each chain b – an inversed element MP() – a chain generator
The MC Algorithm (continued) D = I – A E i = Σ j | d i,j | - weights vector P is a transition probability matrix such that p i,j = | d i,j | / E i - used for generating the marcov chains.
A Small Demonstration A =D = E = 8 6 P = t rand# k t w t b 1, none 1 0
Algorithm’s Architecture MP SW A MP SW A k = i E1E1 EnEn 0 MP SW A b i,j T
Switch & Accumulator K in T in T out K out E in R in E out R out SW E out = E in R out = R in K out = K in If R in = K in Then T out = E in Else T out = T in K in W int C in V in C out V out A * W in W out T in C out = C in W out = W in * T in W int = W out If C in = K in Then V out = V in + W int Else V out = V in
Architecture Demonstration MP SW A k = 1 E 1 = 8 E 2 = 6 0b 1,2 = MP SW A MP SW A K out = 1 K out = 2 T out =8 T out =6 W out =-8W out =-48W out =-384 V out =0 V out =-48
Basic Block Diagram RAM A Memory Controller Algorithm FPGA B Elements request Elements transfer Read/Write
Some scales 64bit * 625 * 625 = 3MB Two matrices needed 6MB 20[msec] / (625^2) = 51.2 [nsec] per one matrix element 20Mhz Considering an O(n^3) algorithm 12.2[Ghz]
Encountered obstacles Studying the Monte-Carlo algorithm and some of its mathematical basics. The architecture requires a lot of FPGA cells. Finding a floating point library and adjusting it to our needs. Getting to know all the software used in an FPGA development
Encountered obstacles (Cont.) The floating point units have a big delay time (130ns for the Division unit alone) Monte-Carlo algorithm needs a delicate tuning and a lot of iterations for achieving a reasonable accuracy A very large bus is needed in order to transfer the matrix elements.
Project achievements Studied the Monte-Carlo algorithm and its architecture. Wrote a C simulation in order to check the Monte-Carlo method. Studied the VHDL language. Found and adjusted a floating point library to the project needs. Ran a simulation for the floating point unit.
Project achievements (cont.) Implemented the switch and accumulator blocks in VHDL. Implemented a basic chain using the switch and accumulator block. Implemented and loaded to the V2P a circuit that used the floating point library.
Things to do Implement the MP block, the memory controller and the computation control circuit. Improve FP delays Design a communication interface to load and send the matrix.