Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

James Edwards and Uzi Vishkin University of Maryland 1.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Introduction CS 524 – High-Performance Computing.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Chapter 17 Parallel Processing.

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity Uzi Vishkin Common wisdom [cf. tribal lore collected by.

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Computer Performance Computer Engineering Department.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

How does it work and what should people know to participate “Work-depth” Alg Methodology (SV82) State all ops you can do in parallel. Repeat. Minimize:

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Lab 2 Parallel processing using NIOS II processors

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

1 Lecture 5a: CPU architecture 101 boris.

CS203 – Advanced Computer Architecture

CS5102 High Performance Computer Systems Thread-Level Parallelism

XMT Another PRAM Architectures

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Mattan Erez The University of Texas at Austin

Chapter 4 Multiprocessors

CS 286 Computer Organization and Architecture

Parallel k-means++ for Multiple Shared-Memory Architectures

Presentation transcript:

Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland

Hardware prototypes of PRAM-On-Chip Objective of current paper Meaningful comparison of 1.Our FPGA design, with 2.State-of-the-Art (Intel) Processor 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” ~7 years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to cores on-chip

XMT: A PRAM-On-Chip Vision Manycores are coming. But 40yrs of parallel computing: Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it  great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Unlike matching current HW (Some other SPAA papers) Tested HW & SW prototypes Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase This paper: ~10X relative to Intel Core 2 Duo If there is time: Really serious about ease of programming

Objective for programmer’s model Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial) XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit. - Competitive on nesting. (To be published.) - No need to program for locality. What could I do in parallel at each step assuming unlimited hardware  # ops.. time # ops time Time = WorkWork = total #opsTime << Work Serial ParadigmNatural (Parallel) Paradigm

Programmer’s Model: Engineering Workflow Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of processes at once – Prefix-sum (ps). Independence of order semantics (IOS). – Establish correctness & complexity by relating to WD analyses. – Circumvents “The problem with threads”, e.g., [Lee]. Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] Trial&error contrast: similar start  while insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} spawnjoinspawnjoin

XMT Architecture Overview One serial core – master thread control unit (MTCU) Parallel cores (TCUS) grouped in clusters Global memory space evenly partitioned in cache banks using hashing No local caches at TCU – Avoids expensive cache coherence hardware Cluster 1 Cluster 2 Cluster C DRAM Channel 1 DRAM Channel D MTCU Hardware Scheduler/Prefix-Sum Unit Parallel Interconnection Network Memory Bank 1 Memory Bank 2 Memory Bank M Shared Memory (L1 Cache) … … …

Paraleap: XMT PRAM-on-chip silicon Built FPGA prototype Announced in SPAA’07 Built using 3 FPGA chips – 2 Virtex-4 LX200 – 1 Virtex-4 FX100 Clock rate75 MHzNo. TCUs64 DRAM size1GBClusters8 DRAM channels1Cache modules8 Mem. data rate0.6GB/sShared cache256KB With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person..  basic simplicity of the XMT architecture simple  faster time to market, lower implementation cost.

Benchmarks Sparse Matrix – Vector Multiplication (SpMV) – Matrix stored in Compact Sparse Row (CSR) format – Serial version: iterate through rows – Parallel version: one thread per row 1-D FFT – Fixed-point arithmetic implementation – Serial version: Radix-2 Cooley-Tukey Algorithm – Parallel version: Parallelized each stage of serial algorithm Quicksort – Serial version: standard textbook implementation – Parallel version: two phases Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum Phase 2: Process all partitions in parallel using serial partitioning algorithm

Experimental Platforms XMT Paraleap FPGAIntel Core 2 Duo Processor1 MTCU, 64 TCUsDual Core, E6300 Clock75 MHz1.86GHz Cache256KB shared L1 cache2x64KB L1, 2x2MB L2 DRAM1GB DDR22GB DDR2 Data rate0.6GB/s6.4GB/s CompilerXMTCC (GCC based)GCC Intel C++ Professional Compiler (ICC) v11 Compiler Optim. -O3, data prefetch, read-only buffers -O3, SSE3 SIMD, data prefetching, auto- parallelization For meaningful comparison: compare cycle count

Input Datasets smalllarge ProgramNFootprintN SpMV22K200KB4M33MB FFT8K192KB4M96MB Quicksort100K781KB20M153MB Large dataset represents realistic input sizes – Recommended by Intel engineer for comparison – Gives Intel Core 2 advantage because of larger cache Small dataset – Fits in both Paraleap and Intel Core 2 cache – Provides most fair comparison for current XMT generation

Clock-Cycle Speedup Core 2 – ICCCore 2 - GCC Program smalllargesmalllarge SpMV FFT Quicksort(*) Paraleap outperforms Intel Core 2 on all benchmarks Lower speed-ups for Large dataset because of smaller cache size – Will not be an issue for future implementations of XMT Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo No reason for clock frequency of XMT to fall behind Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap

Conclusion XMT provides viable answer to biggest challenges for the field – Ease of programming – Scalability (up&down) Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform ICPP’08 paper compares with GPUs.

Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: (i)Cycle-accurate simulator of the XMT machineCycle-accurate simulator of the XMT machine (ii)Compiler from XMTC to that machineCompiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including (i)Tutorial + manual for XMTC (150 pages)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) Video recording of grad Parallel Algorithms lectures (30+hours) Next Major Objective Industry-grade chip. Requires 10X in funding.

Ease of Programming Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.