Lab 2 Parallel processing using NIOS II processors

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS1104: Computer Organisation School of Computing National University of Singapore.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
1 Lecture 6 Performance Measurement and Improvement.
Point-to-Point Communication Self Test with solution.
NIOS II Ethernet Communication Final Presentation
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
ECE Department: University of Massachusetts, Amherst Lab 1: Introduction to NIOS II Hardware Development.
CSCE 313: Embedded Systems Multiprocessor Systems
Multicore experiment: Plurality Hypercore Processor Performed by: Anton Fulman Ze’ev Zilberman Supervised by: Mony Orbach Characterization presentation.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
By: Nadav Haklai & Noam Rabinovici Supervisors: Mike Sumszyk & Roni Lavi Semester:Spring 2010.
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
Lab 2: Capturing and Displaying Digital Image
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1.  Project Goals.  Project System Overview.  System Architecture.  Data Flow.  System Inputs.  System Outputs.  Rates.  Real Time Performance.
Tutorial for QUIZ 1: Interconnects, shared memory, and synchronization
OS Implementation On SOPC Midterm Presentation Performed by: Ariel Morali Nadav Malki Supervised by: Ina Rivkin.
Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
ECE Department: University of Massachusetts, Amherst Using Altera CAD tools for NIOS Development.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
1 Nios II Processor Architecture and Programming CEG 4131 Computer Architecture III Miodrag Bolic.
ECE Department: University of Massachusetts, Amherst ECE 354 Lab 5: Data Compression.
Programming in Java Unit 4. Learning outcome:  LO2: Be able to design Java solutions  LO3: Be able to implement Java solutions Assessment criteria:
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
NIOS II Ethernet Communication Final Presentation
1 Introduction CEG 4131 Computer Architecture III Miodrag Bolic.
EEE440 Computer Architecture
25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.
Altera’s Excalibur Development System Tyson Hall School of Electrical and Computer Engineering Georgia Institute of Technology.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Slide 1 Project 1 Task 2 T&N3311 PJ1 Information & Communications Technology HD in Telecommunications and Networking Task 2 Briefing The Design of a Computer.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CSCI-100 Introduction to Computing
Computer Architecture 2 nd year (computer and Information Sc.)
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Embedded Systems Design with Qsys and Altera Monitor Program
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Instructions to build HIBI_PE_DMA testing system (SOPC+NIOS II Eclipse) Lasse Lehtonen Last modification:
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Introduction to the FPGA and Labs
GCSE Computing - The CPU
Basic Computer Organization and Design
Lab 1: Using NIOS II processor for code execution on FPGA
Embedded Systems Design
Architecture Background
Morgan Kaufmann Publishers
Operating Systems Chapter 5: Input/Output Management
Concurrency: Mutual Exclusion and Process Synchronization
1-2 – Central Processing Unit
System calls….. C-program->POSIX call
GCSE Computing - The CPU
Lecture 18: Coherence and Synchronization
Chapter 3: Process Management
Presentation transcript:

Lab 2 Parallel processing using NIOS II processors CEG 4131 Computer Architecture III Miodrag Bolic

Overview You will learn how to: Design multiprocessing systems that use shared memories Partition sequential program so that it can be implemented on multi-processors Synchronize multiprocessing system Time: 3 weeks Point: 115 (There is an optional task)

Overview Part 1 Design a multiprocessing system by following the steps from the tutorial. Run and debug the program that comes with the tutorial. Part 2 Use the same hardware designed in part 1 Develop a program for parallel matrix multiplication and run it on the multiprocessing system Compute the speedup of the program when it runs on a single processor and on a multiprocessing system

Part 1 Copy the project C:\altera\kits\nios2\examples\vhdl\niosII_stratix_1s10\standard to your home directory Go through the steps of the tutorial “Creating Multiprocessor Nios II System tutorial”. You can download the tutorial from tt_nios2_multiprocessor_tutorial.pdf and a program from http://www.altera.com/literature/tt/hello_world_multi.c Modification: On page 30 of the tutorial, choose NIOS II/s core for CPU3 instead of NIOS II/e. All three cores have to be NIOS II/s. Change the instruction cache size for all 3 of them to 4kBytes. Before generating and compiling on page 36 of the tutorial, do the following: Add performance counter in the same way as in Lab 1. Connect performance_counter only to the data master of the CPU1. Add on-chip Memory block and configure it as shown in the next page. Connect s1 port to cpu1/data_master and cpu2/data_master. Connect s2 port to cpu3/data_master. Continue with the tutorial.

On-chip memory configuration

Task 1 – Demonstration and Questions Show to the TA that the program is working (20 points) Questions: Describe the program in details. Why do we need mutex? If processor 1 gets a mutex for the memory messsage_buffer_ram, can processor 2 write to this memory before processor 1 releases the mutex? Can processor 1 store two messages in the buffer?

Part 2 In this part, the same hardware configuration will be used. You will design a program for parallel matrix multiplication. Problem: There is an input/output module which receives and stores data in matrices in matrices M1 and M2. We will simulate this module using shared_memory module that we added in the first part of the Lab. Our program multiplies these two matrices and stores the result C in the same module (memory).

Sequential solution Program the Altera chip using the same configuration from part 1. Modify the matrix_performance.c file so that matrices M1, M2 and C are transferred to the shared_memory. Do this step before activating the performance counter. Change the number of iterations in matrix multiplication from 100 to 1000. Change the C/C++ options in your project and syslib project from Debug to Release. Run the code and present the performance count results and matrix C that is obtained in the iteration 1000. Demonstration: show the result to the TA.

Parallel solution CPU 1 will be used for synchronization and for I/O operations, while CPU 2 and 3 are used for multiplication. CPU 2 and 3 function in single program multiple data SPMD mode. This means that they start the iterations at the same time and they execute the same code but on different data. After they finish the multiplication, they signal to CPU1. The program will repeat the multiplication of matrices 1000 times.

Parallel matrix multiplication CPU1 transfers M1 and M2 to the shared_memory. Algorithm The sequential program is show bellow. In parallel implementation, CPU 2 will execute i loop from 0 to 4, and CPU 3 will execute i loop from 5 to 9. CPU 2 and 3 will perform their operations at the same time for (i=0;i<=9;i++){ for (j=0;j<=9;j++){ C[i][j] = 0; for (k=0;k<=9;k++){ C[i][j]+=M1[i][k]*M2[k][j]; }

Synchronization Variables status_start and status_done will be shared variables used for synchronization. All three processor will access these variable using the mutex. They will be stored in message_buffer_ram memory. It is extremely important that both CPU2 and CPU3 start matrix multiplication at the same time. This will not happen automatically since they are booted from the same memory. So, CPU1 has to assure that both CPU2 and CPU3 start at the same time. Shared variable status_start will be used for that. CPU1 has to set this variable to 1 and CPU2 and CPU3 have to increment this variable before they start matrix multiplication. When status_start is 3 then CPU2 and CPU3 will start matrix multiplication and CPU1 will initiate measurement of time using the performance_counter. At the beginning, CPU 1 will set status_done to 1. After CPU 2 and CPU 3 finish 1000 iterations of 10x10 matrix multiplication, they each increment the status_done. CPU 1 is periodically reading the variable status_done, and when it is 3, the program is over. CPU1 stops the performance_count and print performance_count result and matrix C from 1000th iteration on the terminal.

Task 2 - Questions What is speedup if we compare sequential and parallel implementation? Comment the speed-up result. Why can we design a program for matrix multiplication without using mutexes (except for synchronization)?

Demonstration (40) Send matrix C of 1000th iteration of the matrix multiplication algorithm to the terminal through JTAG UART. Send also the number of clock cycles from the performance counter. Show this result to the TA. Explain to the TA how your parallel matrix multiplication program works and how you achieved synchronization. You will get 10 points less if speedup is less than 1.

Optional part- Synchronization If our program emulates real system, then CPU1 should synchronize both CPU1 and CPU2 after 1 iteration of 10x10 matrix multiplication and not after 1000 of them. So, in a real program after each 10x10 matrix multiplication, the CPU1 will perform some operations on the computed matrix C and initialize new iteration of 10x10 matrix multiplication if matrices M1 and M2 are ready. In this part of the lab, you will use iteration_done variable to notify CPU1 that one iteration of 10x10 matrix multiplication is done. Additional shared variable is needed for the start of next iteration. Let’s call it start_next_iteration. The program works as follows. At the beginning CPU1 sets start_next_iteration . After 10x10 multiplication iteration starts, CPU2 and CPU3 resets this variable. After CPU2 and CPU3 are done with the execution of their part of 10x10 matrix multiplication, they increment iteration_done and wait for start_next_iteration to be set. CPU1 checks if iteration_done is equal 3 and if it is, CPU1 sets start_next_iteration. The new iteration of 10x10 matrix multiplication can start then.

Optional part – Demonstration and Questions What is the speedup of this program? Demonstration (10 optional points) Send the sum of the elements of matrix C of each iteration of 10x10 matrix multiplication algorithm to the terminal through JTAG UART. Send also the number of clock cycles from the performance counter. Show this result to the TA. Explain to them how you achieved synchronization.

What to submit Report contains the following (30 points): Title page Description of your system with the picture of SOPC Builder System Components Detailed description of your solution of the algorithm for parallel matrix multiplication and synchronization. Answers to the questions from task 1-2. Conclusions Page 17 of this document signed by the TA. Soft copies of the report and source code of the programs for sequential and parallel multiplication with basic comments (*.c files) and quartus II files *.sof and *.ptf (10 points). Optional: Description of the synchronization method and speedup for the optional part as apart of the report. Softcopy of the algorithm for matrix multiplication. (5 points)

Lab 2 – Signature page Student name: Demonstrated (TA’s signature) Performance_counter result - Time Points Part 1 / ____/20 Part 2 sequential Part 2 parallel ____/40 Part 2 optional ____/10 Total ____