1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch Spring 2009

2 Introduction/definition Page 1-4 New HW modules Page 5-9 Testing and debug Page 10-11 Application Page 12 Performance Page 13-17 Summary/conclusions Page 18 Table of Contents

In the previous semester… 1. Implementing a parallel processing system which contains several NoCs, each chip containing several sub- networks of processors. PC forms part of the network using PCI. 2. Writing an application which utilizes parallel processing. 3. Measuring system performance 3 In previous semseter we took previous “router” and converted it to work on Altera platform. In addition we prepared system architecture and microarchitecture. Problem definition:

4 This semester… Implemented the various HW modules needed for larger scale routing: Added 5 th port to all routers/switches Fabric router InterChip GW PC GW 4 Implemented asynchronous MPI commands ( MPI commands were implemented both for Nios and for PC) Wrote example application which utilizes the 64 processors to solve problem (heat transfer) Measured system performance)

5 Putting it all together – a general view of topology 1. Each local cluster has 4 processors. 2. Each chip has 4 clusters (comms) 3. Gidel board has 4 chip – altogether 64 processors 4. PC is also part of chip – switching between 4 FPGAs is done in software – i.e if forms a “virtual switch”.

New HW modules(1) – Fabric router In “Local router” – forwarding is done by rank – i.e rank = port In “Fabric router” – forwarding table is implemented. 6

Routing tables 7 PC CCFFLL Address localfabricchip rankcomm Local router: Similar comm – routing by rank. Other comms – to 5 th port. Other routers: Routing by comm/chip only. myComm,myChip entry used for PC routing Implemented using VHDL’s “generate” command to reuse existing modules. Hex file is created for each router, loaded into ROM using parameter. Grouping (i.e sub-network prefixes) allows us to use small routing table (only 8 entries)

New HW modules(2) – IC GW Primary/Secondary indicates connectivity rather than implementation Interchip interface has increased latency – we use buffers and credits to ensure no fifo overrun Credit counter is initialized with fifo size (i.e 32) as initial #credits Since fifo size > end 2 end latency – block give 100% throughput 8 c Remote buffer Credit counter Local buffer Remote credit release Local credit release (inc)(dec) FIFO

New HW modules(2) – IC routing IC connectivity itself uses Gidel’s fastest busses: 1. Neighbour busses between 1-2, 2-3, 3-4 2. Main bus between 1-4 Both busses are wide enough to support bi-directional traffic i/f : 32 bit data, ctrl, credit_release, push/pop [total: 35 bits X 2]

10 New HW modules(3) – PC GW 10 ToPC GwFromPC Gw Needed for three reasons: 1. FromPCGw adds start/finish “ctrl” signal (parses MPI header for “size” field) 2. Handle PCI idiosyncrasies (minimum messaged length) 3. Use “Gidel’s (req/ack) simple FIFO protocol rather than Altera’s fifo protocol (push/pop)

Testing and debug Since the project is multi-layered, debug can be split into several types: HW (component) issues Connectivity SW (NIOS/PC) Component testing Small testbenches encompassing single block Connectivity Before running main application – we ran connectivity application to check all nios can communicate with each other. Made Specman-E simulation emulating the router’s operation while loading and parsing the real hex files.

Testing and debug 12 SW/NIOS Model Sim was used for logical simulation. Since system was large and debugging is difficult and multi- layered (debugging application run on NIOS), we added special debug registers. Each NIOS writes to these registers (PIO – parallel I/O) during application run, publishing its “state”. In addition, debug registers were attached to main FIFOs to indicate traffic flow (performance counters) PIO FIFO counters When running on chip itself, these registers are sampled and displayed during the application to give indication of system state

Application Parallel jacobian algorithm for approximation solution for the equation. Distribute matrix among CPUs. CPUs communicate with neighbors. Uses computation-communication overlapping. Managed by the host PC. iteration compute interior send/receive boundary compute boundary matrix distribution:

14 Performance – application time vs number of iterations Measurements done on dual core pentium processor running at 2.4Ghz Constant offset indicates PCI latency Running length is #Iterations * (communication + calculation) Linear equation as expected: #Iterations * (communication + calculation) + PCI offset

15 Performance – throughput vs injection rate For low injection rate – routing isn’t a bottleneck => output rate almost identical to input As injection rate increases – router becomes bottleneck Once maximum throughput of router is met – throughput is constant

D(p) – delay(# packets in system) R – average router delay L – system latency λ – injection rate D(p)=R∙p + L P=λ∙D(p) [little’s law] D(p) =λ∙L/(1-λ∙R) Performance – simplified model – delay(congestion) R=50, L=80 16

17 Performance – packet delay vs number of injection stubs Few stubs injection – almost no congestion – constant delay As we approach throughput – congestion increases and delay decreases For very high injection rate –we approach system saturation (since fifo sizes are finite (32 entries) there is a maximum number of packet in the system at any given moment)

18 Performance – packet delay vs injection rate For low injection rate – almost no congestion – constant delay We again see an exponential increase which peters out due to system saturation

1. Original router was robust and easily expanded to support 5 th port and routing tables 2. Debugging software written on this system posed a serious challenge, and required a certain measure of innovation. 3. Despite being on chip – communication between processors still constitutes a serious factor. Therefore, the overall performance system will improve as the calculation/communication ratio decreases. 4. For similar reasons, network can be better used if locality between nodes is utilized. 19 Summary/conclusions: Next steps: 1. Compare topologies (mesh / fat tree ) 2. Develop software to automatically create topologies out of building blocks 3. Simplify router and increase throughput

20 Questions

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

Similar presentations

Presentation on theme: "1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.

Similar presentations

Presentation on theme: "1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir."— Presentation transcript:

Similar presentations

About project

Feedback