Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

Slides:

Advertisements

Similar presentations

Nios Multi Processor Ethernet Embedded Platform Final Presentation

Advertisements

Computer Architecture

Computer Organization, Bus Structure

Categories of I/O Devices

MPI Message Passing Interface

1 SpaceWire Router ASIC Steve Parkes, Chris McClements Space Technology Centre, University of Dundee Gerald Kempf, Christian Toegel Austrian Aerospace.

FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert

Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.

TECH CH03 System Buses Computer Components Computer Function

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Performance Analysis of Processor Midterm Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor: Evgeny.

1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.

GCSE Computing - The CPU

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

CS 179: GPU Programming Lecture 20: Cross-system communication.

CS-334: Computer Architecture

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

MICROPROCESSOR INPUT/OUTPUT

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Top Level View of Computer Function and Interconnection.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.

Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.

EEE440 Computer Architecture

1 DSP handling of Video sources and Etherenet data flow Supervisor: Moni Orbach Students: Reuven Yogev Raviv Zehurai Technion – Israel Institute of Technology.

Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.

ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.

Parallel Programming with MPI By, Santosh K Jena..

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

Network On Chip Platform

Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.

An Introduction to MPI (message passing interface)

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns

The Central Processing Unit (CPU)

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

بسم الله الرحمن الرحيم MEMORY AND I/O.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

Message Passing Interface Using resources from

Data Communication Networks Lec 13 and 14. Network Core- Packet Switching.

MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.

GCSE Computing - The CPU

Lecture 14: Inter-process Communication

CS703 - Advanced Operating Systems

Data Communication Networks

GCSE Computing - The CPU

Presentation transcript:

Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Supervisor: Evgeny Fiksman Performed by: Moshe Bino Alex Tikh Spring ’st Semester Presentation 1

Table of Content Introduction Hardware Design Software Design Debug Process Time Table 2Table of Content

Introduction Hardware Design Software Design Debug Process Time Table 3Table of Content

Problem statement Single CPU is reaching its technological limits, e.g. heat dissipation and cost/power ratio. Thus parallel computing evolved, utilizing multi core processor paradigm. Three major inter-communication techniques are: Message passing, Shared memory and Remote procedure calls. 4Introduction

Project description Multi core system of four MicroBlaze processors is to be built on Xilinx FPGA. Message passing model is chosen for processor inter- communication. Implemented as MPI library specification. Network-on-Chip (NoC) methodology employed for cores interconnect. Dedicated NoC router is implemented. 5Introduction

Project description 6Introduction

Project description The project is a basic SoPC platform for programmable chips. The system can be combined to a multi-core processor, which efficiently handles designated tasks or as a group of hardware accelerators which support the main processor unit. The system can be expanded into a larger network depending on the device resources. The system provides relatively high and flexible computation power on a small device, board etc. 7Introduction

The following components are to be implemented: Quad core system. NoC router (4 ports) and infrastructure for fast communication in multi-core system. Chosen MPI functions written in C. Software application demonstrating the advantages of a parallel system (written in C). Project goals 8Introduction

Constrains: FPGA (V2P) maximum clock frequency 400MHz. MicroBlaze core maximum frequency 100MHz. Processors Memory size 64kbyte. (code + data). Processor to FSL access time - 3 clock cycles. Maximum FSL buffer depth is equals 0.5kbyte. Interrupt handle time - 20 clock cycles (no interrupts nesting). Preferences: Router works at maximum frequency. Router is designed for relatively small messages – maximum 1kbyte due to processors memory size. System specifications 9Introduction

10 MPI - Message Passing Interface MPI is a library specification (language independent) for message-passing, proposed as a standard by a broadly based committee of vendors, implementers, and users. Designed for high performance on both massively parallel machines and on workstation clusters. MPI is widely available, with both free available and vendor-supplied implementations. Introduction

11 The upper word is the Header. The lower word is the Tail. Data is located in the middle. Each word is 32 bit. Message structure Introduction

The Header consist of the fields: Message payload 12 Name Size (bits) OrderDescription H10Represent the Header DST41:4The message destination in the COMM COMM45:8The group of cores in the message destination CMD49:12The command name for this message (Send, Bcast) TYPE413:16The date type in this message DATA CNT1017:26The number of words in this message Name Size (bits) OrderDescription T10Represent the Tail SRC41:4 The message source port in it ’ s SCOMM SCOMM45:8Group of cores in the message source port TAG119:19Message code, group of messages in the same topic\issue * Empty fields where left to allow network and functionality extensions. The Tail consist of the fields: Introduction

Block diagram 13Introduction

Table of Content Introduction Hardware Design Software Design Debug Process Time Table 14Table of Content

Router Implementation 15Hardware Design

Router specification The router consists of one major block called Cross Bar. The Cross Bar is a network switch configured for switching data across multiple ports. it utilizes an efficient arbiter based on Round Robin mechanism. The Cross Bar supports port to port message passing. and broadcasting (not simultaneously). The Cross Bar comprise of 2 main units: 1.Permission unit. 2.Port FSM (for each port). 16Hardware Design

CROSS – BAR 17Hardware Design

Permission process 18Hardware Design Round Robin arbiter- service order according to loop. Check if Dest’ is not busy. Permit for a ‘time slot’. If not requesting, service next requesting port. BUSY and LAST writing ports are saved.

Timer Unit Timing generator - enables each port for constant ‘time slot’. When ‘Permit’ input is de-asserted the present time slot is switched to the next requesting port. If all ports request permission, priority privilege is by order. select relevant Req signal to Controller. 19Hardware Design

Controller 20Hardware Design Checks if enabled port request permission. Checks for busy ports with last writing port. Permit last source port until message delivery ends. Updates busy and last writing port signals.

Port FSM 21Hardware Design Destination is extracted from Header. Request is asserted high. Permission is checked before any state transition. When granted, message is delivered to destination until tail is found. In BCAST, each read word is sent to each port destination in a loop. ports written are saved. request is de-asserted at end.

22 Control Path Arbiter Connects Dest & Permit signals to/from the control Bus according to PORT address. Tri-state Buffers - unused Dest signals are fed with high Z. Unused Permit signals (Port FSM direction) are fed with ‘0’. Hardware Design

Connects the appropriate controls and data to the Buses according to PORT address. Connects the buses to the appropriate fsl according to DEST address. Generally - buses allows increasing ports number by adding Bus Interfaces with the sequential port address. 23Hardware Design Data Path Arbiter

Example 1 At each time slot part of the message is send to it’s destination as long as the destination port is not busy. When Port is busy the next requesting port is service (no delay). 24Hardware Design

Example 2 If one port has no data (port 2) other ports are serviced by order. 25Hardware Design

Example 3 Handling BCAST command and port arbitrating while 2 ports has the same destination. 26Hardware Design

The fifo control bit is “bubbled” in the fifo, representing the message Header and Tail. In the MicroBlaze (MB) direction, This bit indicates the MB about message pending in the fsl pipe. (Interrupt) In the router direction, This bit indicates the router about start/end of message. 27Hardware Design Interrupt Handler

Messages data and FSL control bit are bubbled along the FSL channel. 28Hardware Design FSL – data & control

Table of Content Introduction Hardware Design Software Design Debug Process Time Table 29Table of Content

Software Layers Application Layer: MPI functions interface Network Layer: hardware independent implementation of these functions Data layer: relies on command bit fields Physical layer: designed for FSL bus 30Software Design

MPI Functions set Every MPI function returns an error value. Some of the implemented functions are trivial, and present because required by MPI standard. MPI_Init( int *argc, char ***argv ); MPI_Comm_rank ( MPI_Comm comm, int *rank ); MPI_Comm_size ( MPI_Comm comm, int *size ); MPI_Finalize(); 31Software Design

MPI Functions set Non-trivial functions, used for inter-processors communication are: Send, Interrupt Vector and Recv. Bcast is a combination of Send and Recv, and differs only at low design level. MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ); MPI_Bcast ( void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm ); MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ); 32Software Design

33 MPI Functions set Three additional complimentary functions. Supply additional info about the received message. MPI_Get_source( MPI_Status* status, MPI_Datatype datatype, int *source ); MPI_Get_count( MPI_Status* status, MPI_Datatype datatype, int *count ); MPI_Get_tag( MPI_Status* status, MPI_Datatype datatype, int *tag ); Software Design

MPI_Send: composes header and tail, and sends it with the message (body) Sending the message 34Software Design

Receiving the message Interrupt Vector: receives incoming messages, and stores them in suitable linked list 35Software Design

36 Return received message MPI_Recv: message details received from user. Looks for this message in linked list of already received messages Software Design

Example application Matrix - Vector multiplication Typical example of highly parallel application. VectorRoot processor broadcasts Vector. Matrix RowSelected Matrix Row sent by root to each processor. Each processor computes and returns its result. Computed results are combined into a vector by root processor. 37Software Design

Example application Matrix - Vector multiplication 38Software Design Root

Table of Content Introduction Hardware Design Software Design Debug Process Time Table 39Table of Content

40 Debug - structure Debug Process

The Test Bench reads messages from a file and write them into the FSL pipe (MB output side). It also reads messages from the pipe (MB input side). Signals can also be viewed in ModelSim Debug – Test Bench 41Debug Process

Table of Content Introduction Hardware Design Software Design Debug Process Time Table 42Table of Content

Semester 2 - Tasks 43 Build a quad core systemDone Implement router for the system build a modular router in VHDL Test and debug Router (hardware) MPI API (software) Run a test application measure speed-up as function of average message size and messages amount Time Table

QUESTIONS ?