Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.

Similar presentations


Presentation on theme: "Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington."— Presentation transcript:

1 Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, George Mason University. MAPLD 2004, Washington DC

2 Chitalwala 2 2004 MAPLD – 1010 Abbreviations BRAM – Block RAM GRAM - Generalized Reconfigurable Architecture Model LM - Local Memory Max – Maximum MAP – Multi Adaptive Processor MPM - Microprocessor Memory OCM - On-Chip Memory PE - Processing Element Trans Perms - Transfer of Permissons

3 Chitalwala 3 2004 MAPLD – 1010 Outline Problem Statement GRAM Description Assumptions and Methodology Testbed Description: SRC-6E Results Conclusion and Future Direction

4 Chitalwala 4 2004 MAPLD – 1010 Problem Statement Develop a standardized model of Reconfigurable Architectures. Define a set of synthetic benchmarks based on this model to analyze performance and discover bottlenecks. Evaluate the system against the peak performance specifications given by the manufacturer. Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case.

5 Generalized Reconfigurable Architecture Model (GRAM)

6 Chitalwala 6 2004 MAPLD – 1010 GRAM Benchmarks: Objective To measure maximum sustainable data transfer rates and latency between the various elements of the GRAM. Dynamically characterize the performance of the system against system peak performance.

7 Chitalwala 7 2004 MAPLD – 1010 Generalized Reconfigurable Architecture Model (GRAM)

8 Chitalwala 8 2004 MAPLD – 1010 GRAM Elements PE – Processing Element OCM – On-Chip Memory LM – Local Memory Interconnect Network / Shared Memory Bus Interface Microprocessor Memory

9 Chitalwala 9 2004 MAPLD – 1010 GRAM Benchmarks OCM – OCM: Measure max. sustainable bandwidth and latency between two OCMs residing on different PEs. OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction. OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction. Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction.

10 Chitalwala 10 2004 MAPLD – 1010 GRAM Benchmarks OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction. LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction. LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction. LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction.

11 GRAM Assumptions

12 Chitalwala 12 2004 MAPLD – 1010 Assumptions All devices on board are fed through a single clock No direct path between the Local Memories of individual elements Connections for add-on cards may exist but not shown The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems.

13 Chitalwala 13 2004 MAPLD – 1010 Methodology Data paths can be parallelized to the maximum extent possible. Inputs and Outputs have been kept symmetrical. Hardware timers have been used to measure times taken to transfer data. Measurements have been taken for transfers of increasingly large amounts of data. Data must be verified for correctness after transfers. Multiple paths may exist between the elements specified. Our aim will be to measure the fastest path available. All experiments will be conducted using the programming model and library functions of the system.

14 Testbed Description: SRC-6E

15 Chitalwala 15 2004 MAPLD – 1010 Hardware Architecture of the SRC-6E 800/1600 Mbytes/sec 64 x 6 800 Mbytes/sec 64 x 6 800/1600 Mbytes/sec 800 Mbytes/sec 64 x 6 64

16 Chitalwala 16 2004 MAPLD – 1010 Programming Model of the SRC-6E.c or.f Files.mc or.mf Files μP Compiler MAP Compiler.o Files Linker.v Files.vhd or.v Files Logic Synthesis.ngo FILES Place & Route.bin Files Application Executable APPLICATION

17 Chitalwala 17 2004 MAPLD – 1010 GRAM Benchmarks for the SRC-6E P3/P4  P (1/3 GHz) L2 8000 MIOC 800 P3/P4  P (1/3 GHz) PCISlot SNAPSNAPSNAPSNAP µ Processor Memory (1.5 GB)  P Board P3/P4  P (1/3 GHz) L2 8000 PCISlot MIOC µ Processor Memory (1.5 GB) 800 SNAPSNAPSNAPSNAP L2 P3/P4  P (1/3 GHz)  P Board Control Chip On-BoardMemory (24 MB) 4800 (6 x 800) 4800 UserChip 2400(4800*) UserChip Control Chip On-BoardMemory (24 MB) 4800 (6 x 800) 4800 UserChip 2400(4800*) UserChip MAP III Board 800/1600 MBytes/s 800/1600 MBytes/Sec Ethernet OCM - OCM OCM – Shared Memory OCM - MPM Shared Memory to MPM

18 Chitalwala 18 2004 MAPLD – 1010 GRAM Benchmarks for the SRC-6E BenchmarkSRC-6E OCM – OCMBRAM – BRAM OCM – LMNA OCM – Shared MemoryBRAM – On-Board Memory Shared Memory – MPMOn-Board Memory – Common Memory OCM – MPMBRAM – Common Memory LM – MPMNA LM – LMNA LM – Shared MemoryNA

19 Results

20 Chitalwala 20 2004 MAPLD – 1010 Block Diagram for a Single Bank transfer between OCM to Shared Memory µProcessor Memory to Shared Memory (DMA_in) Start_timer Read_timer(ht0) Read_timer(ht1) Shared Memory to OCM Read_timer(ht2) OCM to Shared Memory Read_timer(ht3) Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4)

21 Chitalwala 21 2004 MAPLD – 1010 Latency Minimum Data Transferred Latency In Clock CyclesLatency in μs Pentium IIIPentium IVPentium IIIPentium IV Shared Memory to OCM 1 word*20 0.20 OCM to Shared Memory 1 word*15 0.15 OCM to OCM (Bridge Port) 1 word *11 0.11 Shared Memory to MPM 4 words *420021004221 MPM to Shared Memory 4 words *1000 10 *1 word = 64 bits

22 Chitalwala 22 2004 MAPLD – 1010 Latency The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock). When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved. When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued.

23 Chitalwala 23 2004 MAPLD – 1010 PROCESSING ELEMENT (FPGA) OCM 1 A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB 64 Shared Memory PROCESSING ELEMENT (FPGA) OCM 2 OCM 1 OCM 2 64 192 Data Path from OCM to OCM Using Transfer Of Permissions

24 Chitalwala 24 2004 MAPLD – 1010 A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB Shared Memory PROCESSING ELEMENT (FPGA 2) OCM 1 64 PROCESSING ELEMENT (FPGA 1) OCM 1 64 Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol

25 Chitalwala 25 2004 MAPLD – 1010 P III & IV: Bandwidth: OCM and OCM (BM#1)

26 Chitalwala 26 2004 MAPLD – 1010 P III: Bandwidth: OCM and OCM (BM#1)

27 Chitalwala 27 2004 MAPLD – 1010 P IV : Bandwidth: OCM and OCM (BM#1)

28 Chitalwala 28 2004 MAPLD – 1010 P IV : Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port)

29 Chitalwala 29 2004 MAPLD – 1010 A 4 MB PROCESSING ELEMENT (FPGA) OCM 1 OCM 2 OCM 3 B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB 64 Shared Memory Control FPGA MICROPROCESSOR MEMORY SNAP Data Path from OCM to MPM and Shared Memory to MPM

30 Chitalwala 30 2004 MAPLD – 1010 P III: Bandwidth: OCM and Shared Memory for a single bank

31 Chitalwala 31 2004 MAPLD – 1010 P III: Bandwidth: OCM and Shared Memory

32 Chitalwala 32 2004 MAPLD – 1010 P IV: Bandwidth: OCM and Shared Memory

33 Chitalwala 33 2004 MAPLD – 1010 P III: Bandwidth: OCM and µP Memory

34 Chitalwala 34 2004 MAPLD – 1010 P IV: Bandwidth: OCM and µP Memory

35 Chitalwala 35 2004 MAPLD – 1010 P III: Bandwidth: Shared Memory and µP Memory (BM#5)

36 Chitalwala 36 2004 MAPLD – 1010 P IV: Bandwidth: Shared Memory and µP Memory

37 Chitalwala 37 2004 MAPLD – 1010 P III: Bandwidth: Shared Memory and µP Memory

38 Chitalwala 38 2004 MAPLD – 1010 P IV: Bandwidth: Shared Memory and µP Memory

39 Chitalwala 39 2004 MAPLD – 1010 A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB Shared Memory 64 PROCESSING ELEMENT (FPGA 1) Register Data Path from FPGA Register to Shared Memory

40 Chitalwala 40 2004 MAPLD – 1010 P III: Bandwidth: Shared Memory and Register

41 Conclusion & Future Direction

42 Chitalwala 42 2004 MAPLD – 1010 GRAM Summation for Pentium III Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency (%) Normalized Transfer Rate (compared with PCI-X @ 133 MHz, 32 bits unidirectional) OCM – OCM a (Bridge Port)80014918.60.28 OCM – OCM b (Trans Perms)80079399.131.5 OCM – OCM c (Streaming)800NA OCM – LMNA OCM → Shared Memory/ Shared Memory → OCM *24002373/237398.8/98.84.46 OCM → MPM/MPM → OCM800/800182.8/227.322.85/28.410.34 / 0.43 Shared Memory → MPM/ MPM → Shared Memory 800/800203/31425.3/39.30.38 / 0.59 Shared Memory → Reg/ Reg → Shared Memory 800/800798/79899.75/99.751.5/1.5 LM – MPMNA LM – LMNA LM – Shared MemoryNA * For three banks

43 Chitalwala 43 2004 MAPLD – 1010 GRAM Summation for Pentium IV Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency (%) Normalized transfer Rate (compared with PCI-X @ 133 MHz, 32 bits unidirectional) OCM – OCM a (Bridge Port)80014918.60.28 OCM – OCM b (Trans Perms)800797.3999.671.5 OCM – OCM c (Streaming)800799.491001.5 OCM – LMNA OCM → Shared Memory/ Shared Memory → OCM *24002392 / 239099.6 / 99.64.5 / 4.5 OCM → MPM/MPM → OCM800/800578 / 56272.25 / 70.251.08 / 1.05 Shared Memory → MPM/ MPM → Shared Memory 800/800796 / 79999.5 / 99.81.5 / 1.5 Shared Memory → Reg/ Reg → Shared Memory 800/800798/79899.75/99.751.5/1.5 LM – MPMNA LM – LMNA LM – Shared MemoryNA * For three banks

44 Chitalwala 44 2004 MAPLD – 1010 Conclusions Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E. Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release.

45 Chitalwala 45 2004 MAPLD – 1010 Conclusions … The GRAM Summation Tables help to serve machine architects in the following ways:  The efficiency column indicates how well a particular communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade.  By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks.  Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate.

46 Chitalwala 46 2004 MAPLD – 1010 Conclusions … In addition, the GRAM Summation tables also provide the following information to application developers:  The tables can tell a designer what bottlenecks to expect and where these bottlenecks lie.  By comparing the figures for Efficiency and the Normalized transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software.  By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system.

47 Chitalwala 47 2004 MAPLD – 1010 Future Direction Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads. The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered.


Download ppt "Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington."

Similar presentations


Ads by Google