CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture 3 Laws, Equality, and Inside a Cell
CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell
CISC 879 : Software Support for Multicore Architectures Two Important Laws Amdahl’s Law Gene Amdahl observation in 1967 Speedup is limited by serial portions Assumes fixed workloads and fixed problem size Gustafson’s Law John Gustafson observation in 1988 Rescues parallel processing from Amdahl’s Law Proposes fixed time and increasing work Sequential portions have diminishing effect
CISC 879 : Software Support for Multicore Architectures Amdahl’s Law 100 Sequential 100 Sequential Parallelize parts 2 and 4 with 2 processors 50 Speedup: 25%
CISC 879 : Software Support for Multicore Architectures Amdahl’s Law (cont’d) 100 Sequential 100 Sequential 50 Speedup: 40% 25 Parallelize parts 2 and 4 with 4 processors
CISC 879 : Software Support for Multicore Architectures Amdahl’s Law (cont’d) 100 Sequential 100 Sequential 50 Speedup: only 70% Parallelize parts 2 and 4 with infinite processors Multicore doesn’t look very appealing!
CISC 879 : Software Support for Multicore Architectures Gustafson’s Law (cont’d) 100 Sequential 100 Sequential 200 Speedup: 40% Boxes contain units of work now! 500 units of time, but 700 units of work!
CISC 879 : Software Support for Multicore Architectures Gustafson’s Law (cont’d) 100 Sequential 100 Sequential 200 Speedup: 220% Boxes contain units of work now! 500 units of time, but 1100 units of work! 400
CISC 879 : Software Support for Multicore Architectures Gustafson Law (cont’d) Gustafson found important observation As processors grow, people scale problem size Serial bottlenecks do not grow with problem size Increasing processors gives linear speedup 20 processors roughly twice as fast as 10 This is why supercomputers are successful More processors allows increased dataset size Reference:
CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell
CISC 879 : Software Support for Multicore Architectures All Multicores Not Equal Multicore CPUs and GPUs are very different! CPUs run general purpose programs well GPUs run graphics (or similar prgs) well General Purpose Programs have Less parallelism More complex control requirements GPU programs Highly parallel Arithmetic intense Simple control requirements
CISC 879 : Software Support for Multicore Architectures Floating-Point Operations GPUs : more computational units and take better advantage of them. 32-bit FP operations per second Slide Source: NVIDIA CUDA Programming Guide 1.1
CISC 879 : Software Support for Multicore Architectures CPUs versus GPUs CPUs devote lots of area to control and storage. GPUs devote most area to computational units. Slide Source: NVIDIA CUDA Programming Guide 1.1
CISC 879 : Software Support for Multicore Architectures CPU Programming Model Slide Source: John Owens, EEC 227 Graphics Arch course Scalar programming model No native data parallelism Few arithmetic units Very small area Optimized for complex control Optimized for low latency not high bandwidth
CISC 879 : Software Support for Multicore Architectures AMD K7 “Deerhound” Slide Source: John Owens, EEC 227 Graphics Arch course
CISC 879 : Software Support for Multicore Architectures GPU Programming Model Slide Source: John Owens (EEC 227 Graphics Arch) and Pat Hanrahan (Stream Prog. Env., GP^2 Workshop) Streams Collections of data records Data parallelism amenable Kernels Inputs/outputs are streams Performs computation on each element of stream No dependencies between stream elements Stream storage Not cache (input read once/output written once) Producer-consumer locality
CISC 879 : Software Support for Multicore Architectures Lecture 2: Overview Know the Laws All are NOT Created Equal Inside a Cell
CISC 879 : Software Support for Multicore Architectures Cell B.E. Design Goals An accelerator extension to Power Exploits parallelism and achieves high frequency Sustain high memory bandwidth through DMA Designed for flexibility Heterogenous architecture PPU for control, general-purpose SPU for computation-intensive, little control Applicable to a wide variety of applications The Cell Architecture has characteristics of both a CPU and GPU.
CISC 879 : Software Support for Multicore Architectures Cell Chip Highlights Slide Source: Michael Perrone, MIT Fall 2007 course 241M Transistors 9 cores, 10 threads >200 GFlops (SP) >20 GFlops (DP) > 300 GB/s EIB 3.2 GHz shipping Top freq. 4.0 GHz (in lab)
CISC 879 : Software Support for Multicore Architectures Cell Details Slide Source: Michael Perrone, MIT Fall 2007 course Heterogenous multicore architecture Power Processor Element (PPE) for control tasks Synergistic Processor Element (SPE) for data- intensive processing SPE Features No cache Large unified register file Synergistic Memory Flow Control (MFC) Interface to high-perf. EIB
CISC 879 : Software Support for Multicore Architectures Cell PPE Details Slide Source: Michael Perrone, MIT Fall 2007 course Power Processor Element (PPE) General Purpose 64-bit PowerPC RISC processor 2-way hardware threaded L1 32KB I; 32KB D L2 512 KB For operating systems and program control
CISC 879 : Software Support for Multicore Architectures Cell SPE Details Slide Source: Michael Perrone, MIT Fall 2007 course Synergistic Processor Element (SPE) 128-bit SIMD architecture Dual Issue Register File 128x128-bit Load Store (256KB) Simplified Branch Arch. No hardware BR predictor Compiler-managed hint Memory Flow Controller Dedicated DMA engine - Up to 16 outstanding requests
CISC 879 : Software Support for Multicore Architectures Compiler Tools Slide Source: Michael Perrone, MIT Fall 2007 course Gnu based C/C++ compiler (Sony) ppu-gcc/ppu-g++ - generates ppu code spu-gcc/spu-g++ - generates spu code Gdb debugger Supports both PPU and SPU debugging Different modes of execution
CISC 879 : Software Support for Multicore Architectures Compiler Tools Slide Source: Michael Perrone, MIT Fall 2007 course The XLC/C++ compiler ppuxlc/ppuxlc++ - generates ppu code spuxlc/spuxlc++ - generates spu code Includes the following optimization levels -O0: almost no optimization -O2: strong, low-level optimization -O3: intense, low-level opts with basic loop opts -O4: all of -O3 and detaild loop analysis and good whole program analysis -O5: all of -O4 and detailed whole-program analysis
CISC 879 : Software Support for Multicore Architectures Performance Tools Slide Source: Michael Perrone, MIT Fall 2007 course Gnu-based tools Oprofile - System level profiler (only PPU) Gprof - generates call graphs IBM Tools Static analysis tool (spu_timing) annotates assembly file with scheduling and instruction issue estimates Dynamic analysis tool (CellBE system simulator) Can run your code on an X86 machine Can collect a variety of statistics
CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 README_build_env.txt (You should IMPORTANT!) Provides details on the build environment features, including files, structure and variables. make.footer Specifies all of the build rules needed to properly build binaries Must be included in all SDK Makefiles (referenced relatively if $CELL_TOP is not defined) Includes make.header make.header Specifies definitions needed to process the Makefiles Includes make.env make.env Specifies the default compilers and tools to be used by make make.footer and make.header should not be modified
CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 Defaults to gcc Set in make.env with three variables set to gcc or xlc PPU32_COMPILER PPU64_COMPILER PPU_COMPILER[overrides PPU32_COMPILER and PPU64_COMPILER] SPU_COMPILER Can change from the command line PPU_COMPILER=xlc SPU_COMPILER=xlc make make -e PPU64_COMPILER:=gcc -e PPU32_COMPILER:=gcc -e SPU_COMPILER:=gcc export PPU_COMPILER=xlc SPU_COMPILER=xlc ; make
CISC 879 : Software Support for Multicore Architectures Compiling with the SDK Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 Use CELL_TOP or maintain relative directory structure ifdef CELL_TOP include $(CELL_TOP)/make.footer else include../../../make.footer endif
CISC 879 : Software Support for Multicore Architectures Makefile variables Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 DIRS list of subdirectories to build first PROGRAM_ppuPROGRAMS_ppu 32-bit PPU program (or list of programs) to build. PROGRAM_ppu64PROGRAMS_ppu64 64-bit PPU program (or list of programs) to build. PROGRAM_spuPROGRAMS_spu SPU program (or list of programs) to build. If written as a standalone binary, can run without being embedded in a PPU program.
CISC 879 : Software Support for Multicore Architectures Makefile variables (cont’d) Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 LIBRARY_embedLIBRARY_embed64 Creates a linked library from an SPU program to be embedded into a 32-bit or 64-bit PPU program. CC_OPT_LEVEL Optimization level for compiler to use CFLAGS, CFLAGS_gcc, CFLAGS_xlc Additional flags for compiler to use (general or specific to gcc/xlc) TARGET_INSTALL_DIR Specifies where built targets are installed
CISC 879 : Software Support for Multicore Architectures Sample Project Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures Next Time Chapters 1-3 NVIDIA CUDA Programming Guide version 1.1 And all of Chapter 29 from GPU Gems 2 Links on website