Download presentation
Presentation is loading. Please wait.
Published bySamson James Modified over 9 years ago
1
PetaScale Execution Time Analysis Architecture/VLSI Chip Floorplan Monarch Chip Overview Computational Sciences Division Bob Lucas – Director Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett USC VITERBI SCHOOL OF ENGINEERING System: four boards with eight PIM chips LD on PIMs in IA64 Host App/Sys Prototype Automatic Performance Tuning Model Guided Empirical Optimization ECO: Combining models and guided empirical search for memory hierarchy optimization Authors: Pedro Diniz, Jeremy Abramson, Tejus Krishna Contact: diniz@isi.edu Performance Expectation Objective Evaluate link discovery (LD) algorithms on Godiva H/W. Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement Expected Results Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity Results of Scalability AnalysisRaw Performance Measurements PIMS for KNOWLEDGE DISCOVERY in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI Tools Organization and Rationale Code Isolator Model Guided Empirical Optimization Results IBM Cu-08 90nm CMOS Clock 333 MHz 64 GOPS/GFLOPS Power 3-6 GFLOPS/W 12 Arithmetic Clusters –96 ALUs (32-bit integer/float) 31 Memory Clusters –256W x 32 bits each (128KB) 6 RISC processors 12 MBytes eDRAM 2 memory interfaces (8 GB/s BW) 2 RapidIO (x4 serial) interfaces 17 DIFL ports (2.6 GB/s ea) On-chip quad ring (40 GB/s) DIFL = Differential Inter FPCA Link MONARCH Project MOrphable Networked ARCHitecture (MONARCH) –DARPA-funded collaboration between USC, Raytheon, Mercury, IBM, Georgia Tech Combines two radically different computing paradigms –Conventional thread-level parallel programming model RISC processor with extensions WideWord (MMX-like) unit formed through morphing Useful for complex code sets containing data-dependent control flow decisions –Stream programming model (dataflow stream operation) Field Programmable Compute Array (FPCA) Useful for predictable operations on large data streams, e.g., pre-filtering of sensor data Achieves highest data throughput AC RISC eDRAM PBUF IC HSS MC AC NWW eDRAM HSS PLL AC RISC AC NWW AC NWW AC NWW AC NWW AC NWW PBUF IC MC Status - currently in fab - First silicon expected 4Q06 - Prototype boards/modules expected 1Q06 ASIC Area Breakdown Full MONARCH Chip Based on IBM’s max die size of 352sq mm (18.76mm on a side) Total Active Cu-08 Cells = 280,054,413 ~100M Gate Equivalents Low-level Binary Instrumentation is too Expensive Takes time, thus precluding observing real runs Generates lots of data, thus forcing to use sampling techniques Approach: synergistic combination of compiler static analysis and dynamic run-time data extraction Static analysis uncovers some program behavior information and identifies data to be extracted at run- time Instruments source code to extract missing data at run-time Advantages: Much faster then binary instrumentation approach Can relate observed metrics to source-level program Goal: Derive Performance Expectations from Source Code for Different Architectures What Should the Performance be and Why? What is Limiting the Performance? Data-Dependences Architecture Limitations Approach: Use Data-Flow Analysis & Scheduling Techniques Extract DFG from the High-Level Source Code Make Assumptions about Memory Hierarchy Compute As-Soon-As-Possible Schedule Vary Number and Implementation Features of Units Load/Store Units Functional Units Compiler Approach to Performance Expectation Architectural Exploration Results for UMT2K No Unrolling of Inner Loop Unrolling Inner Loop by 4x Code: –Inner Loop of the Angular Loop in snswp3D procedure –272 Operations, 4 FP div (non Pipelined); 41 FP Mults; 95 Int Ops; 84 Load/Store; 22 Int Mults. Analysis: –Compute-bound: adding more load/store units won’t help –Not cost effective to have more than 2 ALU (non-unrolled) or 4 ALUs (4x unrolled) Authors: Chun Chen, YoonJu L. Nelson, Jacqueline Chame, Mary Hall Contact: jchame@isi.edu Authors: Jacqueline Chame, Mary Hall, Spundun Bhatt, Tim Barrett Contact: jchame@isi.edu Authors: Jeff Draper, Jeff Sondeen, Sumit Mediratta, Rashed Bhatti, TJ Kwon, Tim Barrett, et. al. Contact: draper@isi.edu Model-guided compiler optimization static models of architecture, profitability Empirical optimization empirical data guide optimization decisions self-tuning libraries such as ATLAS, PhiPAC, FFTW and SPIRAL Exploit complementary strengths of both approaches compiler models prune unprofitable solutions empirical data provide accurate measure of optimization impact analysis/models transformation modules application code architecture specification code variant generation phase 1 set of parameterized code variants + constraints on unbound parameters optimized code variant + representative input data set search engine performance monitoring support execution environment phase 2 optimized code Vendor BLAS ATLAS BLAS Native ECO ECO x ATLAS, vendor BLAS and native compiler matrix multiply on SGI R10K Targeting multimedia extension architectures (Superword-Level Parallelism (SLP) empirical search engine analysis/models application code phase 1 parameterized code variants + constraints on unbound parameters code variants optimized for caches/TLB + unroll&jam to expose SLP transformation modules phase 2 code variant generation on unrolled code: pack isomorphic operations align operands register optimizations: superword replacement, register packing low-level optimizations performance monitoring execution environment optimized code + representative input data set architecture specification select loop order cache and TLB optimizations unroll&jam loops with SLP and spatial reuse Results for Intel SSE In process PPC AltiVec Intel SSE ProgramEnergy Loop Angle Loop Size (LOC) 232K1501.3K Execution Time (hh:mm:ss) 41:02:0500:00:1200:10:00 #Args.1650 Input Data (Bytes) 0.57M61.69M442.84M UMT2K Summary Develop “ benchmark ” of computation kernel from large application Performance behavior equivalent to full application Programmer and/or compiler tool Support Model-guided Empirical Optimization (ECO project) Increase machine and programmer efficiencies Develop tool support for automatic performance tuning Locality optimizations Shared-memory parallel optimizations MUTUAL INFORMATION Clock Execution Time Cycles Instructions Per Cycle Itanium-2900 MHz5.5ms4.9M1.588 Single PIM (superword, compiler+hand tuned) 140 MHz32.1ms4Mn/a GRAPH CLUSTERING Clock Execution Time Cycles Instructions Per Cycle Itanium-2900 MHz0.26ms233K0.806 Single PIM (scalar, compiler) 140 MHz1.11ms155Kn/a 18% Fewer Cycles 33% Fewer Cycles Assume same clock on PIM and Itanium-2 Speedup using 1 PIM = IT2 Cycles PIM Cycles 1.225 for MI 1.503 for GC (1.008 for 2 PIMs) = Now normalize by IPC of scaled data, since PIM behavior is consistent across data sets. IT2 Cycles * (IPC test / IPC scaled ) PIM Cycles = 1.316 for MI 2.611 for GC, (1.75 for 2 PIMs) Original Program Code Fragment to be executed void main(){ Call OutlineFunc(( ){ } void OutlineFunc( ){ } Isolated Program Isolated code 1.Compilable StoreInitialDataValues CaptureMachineState SetMachineState 2.Executable 3.Machine State StoreInitialDataValues =SetInitialDataValues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.