1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Programmable Interval Timer

Internal Logic Analyzer Final presentation-part A

ENGIN112 L30: Random Access Memory November 14, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM)

© 2006 Pearson Education, Upper Saddle River, NJ All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 13 Direct Memory Access (DMA)

1 Lecture 6 Performance Measurement and Improvement.

1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester.

© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.

Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Part A Final Presentation.

Students: Shai Amara Shuki Gulzari Project instructor: Ina Rivkin Matrix Multiplication on SOPC.

IO Controller Module Arbitrates IO from the CCP Physically separable from CCP –Can be used as independent data logger or used in future projects. Implemented.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester Semester : Spring 2006.

CS 151 Digital Systems Design Lecture 30 Random Access Memory (RAM)

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Midterm Presentation.

Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

INPUT/OUTPUT ORGANIZATION INTERRUPTS CS147 Summer 2001 Professor: Sin-Min Lee Presented by: Jing Chen.

Technion Digital Lab Project Performance evaluation of Virtex-II-Pro embedded solution of Xilinx Students: Tsimerman Igor Firdman Leonid Firdman.

The 8086 Microprocessor The 8086, announced in 1978, was the first 16-bit microprocessor introduced by Intel Corporation 8086 is 16-bit MPU. Externally.

Chapter 1 Algorithm Analysis

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

Higher Computing Computer structure. What we need to know! Detailed description of the purpose of the ALU and control unitDetailed description of the.

Dr. Rabie A. Ramadan Al-Azhar University Lecture 6

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Computer Architecture Lecture10: Input/output devices Piotr Bilski.

Intermediate 2 Computing Computer structure. Organisation of a simple computer.

What have mr aldred’s dirty clothes got to do with the cpu

Part A Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.

ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.

CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.

Final Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.

Lab 2 Parallel processing using NIOS II processors

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Spring 2014 Part B.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Lecture 1: Review of Computer Organization

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

MACHINE CYCLE AND T-STATE

JUMP, LOOP, AND CALL INSTRUCTIONS

1 ECE 545 – Introduction to VHDL Algorithmic State Machines Sorting Example ECE 656 Lecture 8.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

10/25/2005Comp 120 Fall October 25 Review for 2 nd Exam on Tuesday 27 October MUL not MULI Ask Questions!

COMPUTER SYSTEM FUNDAMENTAL Genetic Computer School THE PROCESSING UNIT LESSON 2.

Computer Hardware What is a CPU.

DIRECT MEMORY ACCESS and Computer Buses

Seminar On 8085 microprocessor

COURSE OUTCOMES OF Microprocessor and programming

I/O SYSTEMS MANAGEMENT Krishna Kumar Ahirwar ( )

Cache Memories CSE 238/2038/2138: Systems Programming

Topics Introduction to Repetition Structures

CPU Sequencing 6/30/2018.

CS 105 Tour of the Black Holes of Computing

An Introduction to Microprocessor Architecture using intel 8085 as a classic processor

November 14 6 classes to go! Read

BIC 10503: COMPUTER ARCHITECTURE

8253 – PROGRAMMABLE INTERVAL TIMER (PIT). What is a Timer? Timer is a specialized type of device that is used to measure timing intervals. Timers can.

October 29 Review for 2nd Exam Ask Questions! 4/26/2019

Memory System Performance Chapter 3

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Programmable Interval Timer

CPU Sequencing 7/20/2019.

Applied Discrete Mathematics Week 4: Functions

COMPUTER ARCHITECTURE

Presentation transcript:

1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester

2 Project Goals:   Implementing a Matrix multiplication IP. The IP will multiply N x M sized matrix A with M x L sized matrix B and provide an N x L Result matrix.   Integrating the IP on a system on programmable chip (SOPC).

3 Specification of Matrix IP  The matrices sizes, N, M, L can vary from 1 to 127.  The two multiplied matrixes numbers can have values that range from -2^15 up to +2^  The result matrix’s numbers are of type integer (32 bits) and can have values from -2^31 up to +2^31 -1.

4  Each matrix is stored in a separate address range of the IP.  Since an address range is limited to a maximum of 64KB, that is sufficient for 16K(2^14) integers => maximum of a square matrix is 128x128. *our maximum is 127x127. *our maximum is 127x127. Specification of Matrix IP-cont’

5 Implementation General Hardware scheme Processor Matrix Multiplication PLB/OPB bridge Uart PLB OPB

6 IP’s inner address ranges  The IP has 3 Address Ranges: AR0 (matrix A), AR1 (matrix B) and AR2 (Result matrix).  The CPU is only allowed to write to AR0 and AR1 and only to read from AR2.  The IP’s FSM is only allowed to read from AR0 and AR1 and only to write to AR2.

7 General Implementation Idea Block diagram Matrix Multiplication unit Memory Logic FSM Clock Address Data Write Enable data address Write enable R0 start signal and sizes of matrices Data out Matrix A Matrix B The result Matrix Mult Accum R1 Finish Bit

8 Actual Implementation Block diagram

9   First, the processor writes the two matrices into the IP’s 1st and 2 nd address ranges. ADDRESS RANGE 0: ADDRESS RANGE 1: Implementation – a Simple Example 0x00x1 ….. 0x20x3 0x00x10x20x …. 46

10 a Simple Example - continue a Simple Example - continue  Secondly, it writes the matrices sizes (N, M, L) and start bit to the IP’s inner register in the following format: * The IP’s FSM reads N, M, L as unsigned numbers, so the maximum size for each of them is 2^7 -1 = L M N Start 22 Don ’ t care

11 a Simple Example - continue a Simple Example - continue  In our example the sizes could be 2x2 and 2x1 or 4x1 and 1x2.  Let’s take the case of 2x2 and 2x1.  The inner register will be written with: start NM L

12 Finish bit <=1 n, m, l <= sizes Address_A <= i + row*m Address_B <= i*l + col Sel_A <= 1 Sel_B <= 1 Idle start=‘0’ start =‘1’ row <= 0 i <= 0 col <= 0 i <= i +1 Sel_A<=0 Sel_B<=0 i< m-1 WE<=1 Data_out<=data_in Add_out<=row*l+col col<= col+1 i = m-1 row <= row +1 WE <= 0 i<= 0 WE<=0 col < l -1 row < n -1 row = n -1 col = l -1 IP’s FSM

13 EXAMPLE – Continue  In our example the fsm will do the following: 0x00x10x2 0x3 0x00x Xilinx Multiplier accumulator x00x AR0 AR1 AR2

14 EXAMPLE – Continue  And Indeed: X =

15 Implementation - continue  The Result matrix is saved in the IP’s third address range. the IP informs the processor about the completion of the task by asserting finish bit that is being polled by the CPU.  After the CPU reads that finish bit = 1, it can read the result matrix from the IP.

16 The Verification Process  For sizes of up to 16*16 the validation was by allocating memory and random values for matrices A, B.  The validation was simply a comparison between matrices C (result) and D (expected).   When dealing with larger sizes we encountered a problem of allocating large memories (in software).   So we didn’t allocate memory and used instead:   A[i] [j] = i + j ; B[i] [j] = i - j ; And compared it to the known result.

17 Performance analysis  The state machine number of clock cycles: { [ (3*M +2) x L ] + 2 } x N + 3 = …= { [ (3*M +2) x L ] + 2 } x N + 3 = …= = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L). = 3*M*L*N + 2*L*N + 2*N + 3 = O(N*M*L).  Total : O(N*M*L) clock cycles.  Since we found it difficult to find the number of clock cycles that take in software, we conducted a comparison in software that gives a good indication on our hardware.

18 Performance analysis – continue  In Software the calculation is: for (i=0; i<N; i++) for(j=0; j<L; j++) for(k=0; k<M; k++) for(k=0; k<M; k++) C[i][j] + = A[i][k] * B[k][j];  In this implementation the CPU enters the loop N*M*L times (not clock cycles) !

19 Performance analysis – continue  In order to compare it to our IP’s performance, we counted the number of times we “visit” inside the while() loop in which we wait for the finish signal.  The following graph shows a comparison between the number of CPU operations for square matrices of sizes 2x2 til 15x15.

20 Performance analysis – Comparison results Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matrices !!! Conclusion – Our IP provides an excellent solution for applications that require many multiplications of large matrices !!!

21 Improvement suggestions  For better performance additional Multipliers can be added to the design. so that in each cycle more numbers could be multiplied and speed up the calculation time. so that in each cycle more numbers could be multiplied and speed up the calculation time.  using an interrupt instead of polling would also save valuable CPU time.

22 Thank you !