Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

Similar presentations


Presentation on theme: "1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir."— Presentation transcript:

1 1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch Winter 2009

2 2 General Page 2 Network topology Page 5 CPU architecture and interface Page 10 Software design Page 14 Routing algorithm Page 16 Broadcast algorithm Page 19 Time tables Page 20 Table of Contents

3 Project goals Implementing a parallel processing system which contains several NoCs, each chip containing several sub-networks of processors. Converting existing router to support Altera platform. Expanding the router to enable communications between similar sub-networks. Implementing a processor network which supports communication with the PC enabling: Use of PC’s CPU as part of the processing network. Simple I/O between PC and the rest of the processing network. 3

4 4 Top-level design elements Network topology – basic network architecture: Intra-chip Inter-chip Chip-PC CPU architecture – various models of Nios II : Processor performance Data/Code cache Internal multiplier/divider CPU/router interface – Synchronous transfer (interrupt + CPU controls transfer of data) Asynchronous transfer (DMA read/write) Routing algorithms - Routing Broadcast

5 Fabric topology 5 Alternatives: SPINCLICHETorus Considerations: Router to processor ratio Number of ports necessary (area, simplicity, inter-router scheduling) Scalability Congestion Latency

6 Our topology: Fabric has cliché structure: easily scalable, needs max. 5 ports Each FR is connected to a cluster of CPUs Cluster uses local router Higher CPU to router ratio (R~P in cliché, R~P/2 in our topology) Increased latency is masked by CPU to router clock speed ratio Increased congestion With #CPUs we plan to use: 6

7 Top-level structure of the expanded network Each white square represents a single FPGA on the Gidel board. FPGA-FPGA, FPGA-PC routes go via designated routers (GW). The GWs design/protocols are the same as the internal routers. 7

8 Fabric topology Each fabric router is connected to a cluster of processors using a local router. GWs are used to connect to other chips and to PC. Primary/Secondary ICGW determined by #chips to left/right (interconnects /GWs will be explained when we deal with routing algorithm). 8

9 Structure of a single sub-network (from previous project) 9 Single fabric element for our project

10 10 CPU architecture 10 CPU architecture – various models of Nios II : Processor speed Data/Code cache Internal multiplier/divider Pros and cons: Performance Area I/O, CPU contention (more in the next slide).

11 11 Synchronous interface: Using custom instructions: Connect interrupt using PIO. Connect FIFO directly to CPU to avoid avalon-bus access cycles. CPU/router I/F Avalon bus CPU Router-FIFO PIO C/I Asynchronous interface: Using DMA: Connect interrupt using PIO. Connect FIFO directly to avalon using router-fifo interface. Same interface also makes interrupt to start DMA read. Avalon bus FIFO I/F DMACPUMemory Router-FIFO

12 12 Synchronous/asynchronous transfer : Problem: synchronous transfer requires CPU to direct I/O, disabling the CPUs ability to perform calculations during I/O. Solution: introduce asynchronous transfer using DMA, data is copied directly from input FIFO to NIOS II memory. Requirements: CPU needs a data cache to prevent congestion on Avalon bus (CPU memory reads and DMA-FIFO memory read/writes. Pros and cons: Larger CPU (area), simplicity of implementation. Avalon bus FIFO I/F DMACPUMemory Router-FIFO CPU/router I/F

13 13 Summary (comparing number of total number of CPUs ) Estimation of router/processor area according to existing router/NIOS and total LEs available on Stratix II FPGA. Slow CPU enables almost double number of CPUs but performance is less than half. We chose fast w/DMA using our topology. Relative area of router/proc.

14 14 Software layers Software design Application Layer: MPI functions interface Network Layer: hardware independent implementation of these functions Data layer: relies on command bit fields Physical layer: designed for FSL bus Adjust to conform with altera i/f Add async. functions

15 15 Asynchronous transfer requires additional MPI functions: MPI_isend – non blocking send MPI_irecv – non blocking receive MPI_test – test whether data has arrived in receive buffer MPI_wait – blocking wait for data to arrive Software design

16 Routing algorithms Routing categories: Static/dynamic Source/hop by hop Centralized/distributed Splitting/non splitting Intra-chip algorithm to be used Static, centralized, hop by hop, non-splitting Static routing tables to be loaded into tables on router Run algorithm on node map of each chip, find shortest path, each router holds table with next hop for each other node. (#nodes ~ 60) 16

17 Routing algorithms Inter-chip routing Adjacent FPGAs connect using “neighbour bus” FPGAs 1,4 connect using slower “main bus” Distribute routes evenly (each arc a-d carries exactly 3 “traffic units, e.g. arc (a) is used to carry data from 2,3 to 1 and from 1 to 2,3. Result: Assuming evenly distributed communication between processing units, each FPGA uses interconnect on one side twice as much as the interconnect on the other side, hence the “primary” and “secondary” inter connect gateways. 17 1234 abc d d

18 Routing algorithms Primary/secondary GW intra-chip implifications: Because of above assumption, more internal fabric routers will be connected to one gateway than to the other. Assume that I/O to the PC isn’t dominant enough to justify connection of all fabric routers to PCGW. Connect S-ICGW and PCGW so as to minimize internal congestion in fabric. 18

19 Broadcast algorithms Build static broadcast tree Run algorithm to make spanning tree on network nodes Each node stores collection of ports (=arcs) which are part of the tree – read-only tables stored in HW Upon initiating broadcast message: send to all BC ports Upon receiving broadcast message on BC port – send to all other BC ports 19

20

21 21 Questions


Download ppt "1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir."

Similar presentations


Ads by Google