DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,

Slides:

Advertisements

Similar presentations

Computer Abstractions and Technology

Advertisements

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.

Some Thoughts on Technology and Strategies for Petaflops.

Configurable System-on-Chip: Xilinx EDK

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

DUSD(Labs) Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Automated Design of Custom Architecture Tulika Mitra

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.

J. Christiansen, CERN - EP/MIC

Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?

Spring 2007Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures)

© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.

Computer Organization and Design Computer Abstractions and Technology

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Lynn Choi School of Electrical Engineering

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Dynamo: A Runtime Codesign Environment

Morgan Kaufmann Publishers

Architecture & Organization 1

Architecture & Organization 1

Human Complexity of Software

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

Chapter 1 Introduction.

Computer Evolution and Performance

COMS 361 Computer Organization

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steve S. Lumetta University of Illinois at Urbana-Champaign

PACT Keynote, October 1, 2004 Semiconductor computing platform challenges performance billion transistors powercost DSP/ASIP security Intelligent RAM feature set reliability Mem. Latency/Bandwidth Power Constraints Microprocessors Reconfigurability accelerators O/S limitations S/W inertia wire load process variation leakage fab cost

PACT Keynote, October 1, 2004 ASIC/ASIP economics u Optimistically, ASIC/ASSP revenues growing 10–20% / year s Engineering portion of budget is supposed to be trimmed every year (but never is) s Chip development costs rising faster than increased revenues and decreased engineering costs can make up the difference s Implies 40% fewer IC designs (doing more applications) - every process generation!! Number of IC Designs ≤ Per-chip Development Cost Total ASIC/ASSP Revenues Engineering Costs × 10-20% 5-20% 40% %

PACT Keynote, October 1, 2004 ASIPs: non-traditional programmable platforms Level of concurrency must be comparable to ASICs XScale Core Hash Engine Scratch- pad SRAM RFIFO Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine QDR SRAM QDR SRAM QDR SRAM QDR SRAM RDRAM PCI CSRs TFIFO SPI4 / CSIX ASIPs will be on-chip, high- performance multi-processors

PACT Keynote, October 1, 2004 Example embedded ASSP implementations Intel IXP1200 Network Processor Philips Nexperia (Viper) MIPS VLIW

PACT Keynote, October 1, 2004 What about the general purpose world u Clock frequency increase of computing engines is slowing down s Power budget hinders higher clock frequency s Device variation limits deeper pipelining s Most future perf. improvement will come from concurrency and specialization u Size increase of single-thread computing engines is slowing down s Power budget limits number of transistors activated by each instruction s Need finer-grained units for defect containment s Wire delay is becoming a primary limiter in large, monolithic designs u The approach to covering all applications with a primarily single execution model is showing limitations

PACT Keynote, October 1, 2004 Impact of Transistor Variations 130nm 30% 5X Frequency~30%LeakagePower~5X Normalized Leakage (I sb ) Normalized Frequency Source: Shekhar Borkar, Intel

PACT Keynote, October 1, 2004 Metal Interconnects Interconnect RC Delay Source: Shekhar Borkar, Intel Line Cap (Relative) Low-K ILD Line Res (Relative) Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect RC Delay (Relative) 0.7x Scaled RC Delay

PACT Keynote, October 1, 2004 Measured SPECint2000 Performance on real hardware with same fabrication technology Date: October 2003

PACT Keynote, October 1, 2004 General processor cores u Very low power compute and memory structures u O/S provides lightweight access to custom features Acceleration logic u Application specific logic u High-bandwidth, distributed storage (RAM, registers) u To developer, behave like software components Memory system u Data delivery to processor u O/S and virtual memory issues u Intelligent memory controllers Application processors u Lightweight compute engines u High-bandwidth, distributed storage (RAM, registers) u High-bandwidth, scalable interconnect Convergence of future computing platforms

PACT Keynote, October 1, 2004 Breaking the memory wall with distributed memory and data movement

PACT Keynote, October 1, 2004 Parallelization with deep analysis: Deconstructing von Neumann [IWLS2004] u Memory dataflow that enables s Extraction of independent memory access streams s Conversion of implicit flows through memory into explicit communication u Applicability to mass software base requires pointer analysis, control flow analysis, array dependence analysis CPU Weight_Ai (Az, F_ga3, Ap3) Weight_Ai (Az, F_g4, Ap4) Residu (Ap3, &syn_subfr[i],) Copy (Ap3, h, 11) Set_zero (&h[11], 11) Syn_filt (Ap4, h, h, 22, &h) tmp = h[0] * h[0]; for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i]; tmp1 = tmp >> 8; tmp = h[0] * h[1]; for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1]; tmp2 = tmp >> 8; if (tmp2 <= 0) tmp2 = 0; else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1; preemphasis (res2, temp2, 40) Syn_filt (Ap4, res2, &syn_p), 40, mem_syn_pst, 1); agc (&syn[i_subfr], &syn) 29491, 40) res2 m_syn F_g3 F_g4 Az_4 synth syn Ap3 Ap4 h tmp tmp1 tmp2 CPU DRAM DRAMDRAM Weight_Ai Copy+ Set_zero Residu Syn_filt Corr0/Corr1 preemph agc Syn_filt PE’s res2 m_syn F_g3 F_g4 Az_4 synth syn Ap3 Ap4 h tmp tmp1 tmp2 PE’s DRAM

PACT Keynote, October 1, 2004 Memory bottleneck example (G.724 Decoder Post-filter, C code) u Problem: Production/consumption occur with different patterns across 3 kernels  Anti-dependence in preemphasis function (loop reversal not applicable) s Consumer must wait until producer finishes u Goal: Convert memory access to inter-cluster communication ** * * + Residu preemphasis **** + Syn_filt res [0:39] [39:0] [0:39] MEM time

PACT Keynote, October 1, 2004 Breaking the memory bottleneck u Remove anti-dependence by array renaming u Apply loop reversal to match producer/consumer I/O u Convert array access to inter- component communication ** * * + Residu + **** Syn_filt res preemphasis res2 time Interprocedural pointer analysis + array dependence test + array access pattern summary+ interprocedural memory data flow

PACT Keynote, October 1, 2004 u Full system environment s Linux running on PowerPC s Lean system with custom Linux (Nacho Navarro, UIUC/UPC) s Virtex 2 Pro FPGA logic treated as software components u Removing memory bottleneck s Random memory access converted to dataflow s Memory objects assigned to distributed Block RAM u SW / HW communication s PLB vs. OCM interface A prototyping experience with the Xilinx ML300

PACT Keynote, October 1, 2004 Initial results from our ML300 testbed u Case study: GSM vocoder s Main filter in FPGA s Rest in software running under Linux with customized support s Straightforward software/ accelerator communications pattern s Fits in available resources on Xilinx ML300 V2P7 s Performance compared to all- software execution, with communication overhead Projected filter latency ~8x ~32x Cycles SoftwareNaïveOptimized Hardware implementation

PACT Keynote, October 1, 2004 Applications and Systems Software Applications and Systems Software Grand challenge u Moving the mass-market software base to heterogeneous computing architectures s Embedded computing platforms in the near term s General purpose computing platforms in the long run Platforms Programming models Restructuring compilers Communications and storage management Accelerator architectures OS support

PACT Keynote, October 1, 2004 Slicing through software layers

PACT Keynote, October 1, 2004 Taking the first step: pointer analysis u To what can this variable point? (points-to) s Can these two variables point to the same thing? (alias) s Fundamental to unraveling communications through memory: programmers like modularity and pointers! u Pointer analysis is abstract execution s Model all possible executions of the program s Has to include important facets, or result won’t be useful s Has to ignore irrelevant details, or result won’t be timely s Unrealizable dataflow = artifacts of “corners cut” in the model u Typically, emphasis has been on timeliness, not resolution, because expensive algorithms cause unstable analysis time – for typical alias uses, may be OK… u …but we have new applications that can benefit from higher accuracy s Data flow unraveling for logic synthesis and heterogeneous systems

PACT Keynote, October 1, 2004 How to be fast, safe and accurate? u An efficient, accurate, and safe pointer analysis based on the following two key ideas Efficient analysis of a large program necessitates that only relevant details are forwarded to a higher level component The algorithm can locally cut its losses (like a bulkhead) … … to avoid a global explosion in problem size

PACT Keynote, October 1, 2004 One facet: context sensitivity u Context sensitivity – avoids unrealizable data flow by distinguishing proper calling context  What assignments to a and g receive?  CI: a and g each receive 1 and 3  CS: g receives only 1 and a receives only 3 u Typical reactions to CS costs s Forget it, live with lots of unrealizable dataflow s Combine it with a “cheapener” like the lossy compression of a Steensgaard analysis u We want to do better, but we may sometimes need to mix CS and CI to keep analysis fast Desired results Example

PACT Keynote, October 1, 2004 Context Insensitive (CI) u Collecting all the assignments in the program and solving them simultaneously yields a context insensitive solution u Unfortunately, this leads to three spurious solutions.

PACT Keynote, October 1, 2004 Context Sensitive (CS): Naïve process Retention of side effect still leads to spurious results Excess statements unnecessary and costly

PACT Keynote, October 1, 2004 CS: “Accurate and Efficient” approach Now, only correct result derived Compact summary of jade used Summary accounts for all side-effects. DELETE assignment to prevent contamination

PACT Keynote, October 1, 2004 Analyzing large, complex programs [SAS2004] Bench- mark INACCURATE Context Insensitive (seconds) (seconds) PREV Context- Sensitive (seconds) NEW Context- Sensitive (seconds) espresso291 li ijpeg2851 perl gcc52HOURS124 perlbmk155MONTHS198 gap vortex51363 twolf121 This results in an efficient analysis process without loss of accuracy Originally, problem size exploded as more contexts were encountered New algorithm contains problem size with each additional context

PACT Keynote, October 1, 2004 Example application and current challenges [PASTE2004] Improved efficiency increases the scope over which unique, heap- allocated objects can be discovered Example: Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools

PACT Keynote, October 1, 2004 From benchmarks to broad application code base u The long term trend is for all code to go through a compiler and be managed by a runtime system s Microsoft code base to go through Phoenix – OpenIMPACT participation s Open source code base to go through GCC/OpenIMPACT under Gelato u The compiler and runtime will perform deep analysis to allow tool to have visibility into software s Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration, memory managers, runtime, etc. systems Applications systems Operating systems systems Libraries systems Compiler systems Runtime and Tools systems Hardware

PACT Keynote, October 1, 2004 Global memory dataflow analysis u Integrates analyses to deconstruct memory “black box” s Interprocedural pointer analysis: allow programmer to use language and modularity without losing transformability s Array access pattern analysis: figure out communication among loops that communicate through arrays s Control and data flow analyses: enhance resolution by understanding program structure s Heap analysis extends analysis to much wider software base u SSA-based inductor detection and dependence test have been integrated into IMPACT environment

PACT Keynote, October 1, 2004 foo (int *s, int L) { int *p=s, i; for (i=0; i<L; i++) *p =...; p++; } foo writes A[0:63] stride 1 bar reads A[1:64] stride 1 procedure call parameter mapping Read from *(t) to *(t+M) to *(t+M) with stride 1 with stride 1 Procedure body summary for the whole loop Write *p loop body main(...) { int A[100]; foo(A, 64); foo(A, 64); bar(A+1, 64) bar(A+1, 64) } bar (int *t, int M) { int *q=t, i; for (i=0; i<M; i++) … = *q; q++; } Write from *(s) to *(s+L) to *(s+L) with stride 1 with stride 1 Read *q Data flow analysis determines that A[64] is not from foo Pointer relation analysis restates p/q in terms of s/t Example on deriving memory data flow

PACT Keynote, October 1, 2004 Conclusions and outlook u Heterogeneous multiprocessor systems will be the model for both general purpose and embedded computing platforms in the future s Both are motivated by powerful trends s Shorter term adoption for embedded systems s Longer term for general purpose systems u Programming models and parallelization of traditional programs to channel software to these new platforms s Feasibility of deep pointer analysis demonstrated s Many need to participate in solving this grand challenge problem