Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland
Hardware prototypes of PRAM-On-Chip Objective of current paper Meaningful comparison of 1.Our FPGA design, with 2.State-of-the-Art (Intel) Processor 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” ~7 years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to cores on-chip
XMT: A PRAM-On-Chip Vision Manycores are coming. But 40yrs of parallel computing: Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Unlike matching current HW (Some other SPAA papers) Tested HW & SW prototypes Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase This paper: ~10X relative to Intel Core 2 Duo If there is time: Really serious about ease of programming
Objective for programmer’s model Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial) XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit. - Competitive on nesting. (To be published.) - No need to program for locality. What could I do in parallel at each step assuming unlimited hardware # ops.. time # ops time Time = WorkWork = total #opsTime << Work Serial ParadigmNatural (Parallel) Paradigm
Programmer’s Model: Engineering Workflow Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of processes at once – Prefix-sum (ps). Independence of order semantics (IOS). – Establish correctness & complexity by relating to WD analyses. – Circumvents “The problem with threads”, e.g., [Lee]. Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] Trial&error contrast: similar start while insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} spawnjoinspawnjoin
XMT Architecture Overview One serial core – master thread control unit (MTCU) Parallel cores (TCUS) grouped in clusters Global memory space evenly partitioned in cache banks using hashing No local caches at TCU – Avoids expensive cache coherence hardware Cluster 1 Cluster 2 Cluster C DRAM Channel 1 DRAM Channel D MTCU Hardware Scheduler/Prefix-Sum Unit Parallel Interconnection Network Memory Bank 1 Memory Bank 2 Memory Bank M Shared Memory (L1 Cache) … … …
Paraleap: XMT PRAM-on-chip silicon Built FPGA prototype Announced in SPAA’07 Built using 3 FPGA chips – 2 Virtex-4 LX200 – 1 Virtex-4 FX100 Clock rate75 MHzNo. TCUs64 DRAM size1GBClusters8 DRAM channels1Cache modules8 Mem. data rate0.6GB/sShared cache256KB With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person.. basic simplicity of the XMT architecture simple faster time to market, lower implementation cost.
Benchmarks Sparse Matrix – Vector Multiplication (SpMV) – Matrix stored in Compact Sparse Row (CSR) format – Serial version: iterate through rows – Parallel version: one thread per row 1-D FFT – Fixed-point arithmetic implementation – Serial version: Radix-2 Cooley-Tukey Algorithm – Parallel version: Parallelized each stage of serial algorithm Quicksort – Serial version: standard textbook implementation – Parallel version: two phases Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum Phase 2: Process all partitions in parallel using serial partitioning algorithm
Experimental Platforms XMT Paraleap FPGAIntel Core 2 Duo Processor1 MTCU, 64 TCUsDual Core, E6300 Clock75 MHz1.86GHz Cache256KB shared L1 cache2x64KB L1, 2x2MB L2 DRAM1GB DDR22GB DDR2 Data rate0.6GB/s6.4GB/s CompilerXMTCC (GCC based)GCC Intel C++ Professional Compiler (ICC) v11 Compiler Optim. -O3, data prefetch, read-only buffers -O3, SSE3 SIMD, data prefetching, auto- parallelization For meaningful comparison: compare cycle count
Input Datasets smalllarge ProgramNFootprintN SpMV22K200KB4M33MB FFT8K192KB4M96MB Quicksort100K781KB20M153MB Large dataset represents realistic input sizes – Recommended by Intel engineer for comparison – Gives Intel Core 2 advantage because of larger cache Small dataset – Fits in both Paraleap and Intel Core 2 cache – Provides most fair comparison for current XMT generation
Clock-Cycle Speedup Core 2 – ICCCore 2 - GCC Program smalllargesmalllarge SpMV FFT Quicksort(*) Paraleap outperforms Intel Core 2 on all benchmarks Lower speed-ups for Large dataset because of smaller cache size – Will not be an issue for future implementations of XMT Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo No reason for clock frequency of XMT to fall behind Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap
Conclusion XMT provides viable answer to biggest challenges for the field – Ease of programming – Scalability (up&down) Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform ICPP’08 paper compares with GPUs.
Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: (i)Cycle-accurate simulator of the XMT machineCycle-accurate simulator of the XMT machine (ii)Compiler from XMTC to that machineCompiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including (i)Tutorial + manual for XMTC (150 pages)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) Video recording of grad Parallel Algorithms lectures (30+hours) Next Major Objective Industry-grade chip. Requires 10X in funding.
Ease of Programming Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.