DUSD(Labs) Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

The Assembly Language Level

Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

MotoHawk Training Model-Based Design of Embedded Systems.

Data - Information - Knowledge

Router Architecture : Building high-performance routers Ian Pratt

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.

6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Computer System Architectures Computer System Software

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Spring 2007Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures)

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 6 Using Methods.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

QCAdesigner – CUDA HPPS project

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

Full and Para Virtualization

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Memory-Aware Compilation Philip Sweany 10/20/2011.

Parallel Computing Presented by Justin Reschke

Advanced Computer Systems

Microprocessor and Assembly Language

Parallel Algorithm Design

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Many-core Software Development Platforms

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Compiler Back End Panel

Compiler Back End Panel

Lecture 4: Instruction Set Design/Pipelining

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

CSCI1600: Embedded and Real Time Software

Presentation transcript:

DUSD(Labs) Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steven S. Lumetta University of Illinois at Urbana-Champaign

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 2 Trends in hardware u High variability s Increasing speed and power variability of transistors s Limited frequency increase s Reliability / verification challenges u Large interconnect delay s Increasing interconnect delay and shrinking clock domains s Limited size of individual computing engines Interconnect RC Delay Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect Data: Shekhar Borkar, Intel

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 3 Trends in architecture u Transistors are free… until connected or used u Continued scaling of traditional processor core no longer economically viable s 2-3X effective area yields ~1.6X performance [PollackMICRO32] s Verification, power, transistor variability u Only obvious scaling route: “Multi-Everything” s Multi-thread, multi-core, multi-memory, multi-? s CW: Distributed parallelism is easy to design u But what about software? s If you build a better mousetrap…

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 4 A “multi-everything” processor of the future u Distributed, less complex components s Variability, power density, and verification – easier to address u Who bears the SW mapping burden? s General purpose software changes prohibitively expensive (cf. SIMD, IA-64) s Advanced compiler features “Deep Analysis” s New programming models / frameworks s Interactive compilers

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 5 General purpose processor component(s) u The system director u Performs traditionally- programmed tasks s software migration starts here u Likely multiple GPP’s u Less complex processor cores

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 6 Computational efficiency through customization u Goal: Offload most processing to more specialized, more efficient units u Application Processors (APP) s Specialized instruction sets, memory organizations and access facilities u Programmable Accelerators (ACC) s Think ASIC with knobs s Highly-specialized pipelines s Approximate ASIC design points u Higher performance/watt than general purpose for target applications

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 7 Memory efficiency through diversity u Traditional monolithic memory model – major power / performance sink u Need partnership of general- purpose memory hierarchy and software-managed memories u Local memories will reduce unnecessary memory traffic and power consumption u Bulk data transfer scheduled by Memory Transfer Module u Software will gradually adopt decentralized model for power and bandwidth

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 8 Tolerating communication & adding macropipelining u Bulk communication overhead often substantial for traditional accelerators u Shared memory / snooping communication approach limits available bandwidth u Compilation tools will have to seamlessly connect processors and accelerators u Accelerators will be able to operate on bulk transferred, buffered data… … or on streamed data

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 9 Embedded systems already trying out this paradigm XScale Core Hash Engine Scratch- pad SRAM RFIFO Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine QDR SRAM QDR SRAM QDR SRAM QDR SRAM RDRAM PCI CSRs TFIFO SPI4 / CSIX Intel IXP1200 Network Processor Philips Nexperia (Viper) ARM MICRO- ENGINES ACCESS CTL. MIPS MPEG VLIW VIDEO MSP Intel IXP2400 Network Processor

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 10 Decentralizing parallelism in a JPEG decoder u Convert a typical media-processing application to the decentralized model s Arrays used to implement streams s Multiple loci of computation with various models of parallelism s Memory access bandwidth a bottleneck w/o private data Conceptual dataflow view of two JPEG decoding steps

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 11 Data privatization and local memory Conceptual dataflow view of two JPEG decoding steps u Accelerate color conversion first (execute in ACC or APP) s Main processor sends inputs, receives outputs u Large tables – inefficient to send data from main processor s Need tables to reside in the accelerator for efficiency of access s Tables are initialized once during program execution, and never modified again s Accurate pointer analysis necessary to determine this

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 12 Increasing parallelism u Heavyweight loop nests communicate though intermediate array u Direct streaming of data is possible, supports higher parallelism (macropipelining)  Convert() and Upsample() loops can be chained u Accurate interprocedural dataflow analysis is necessary

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 13 How the next-generation compiler will do it (1) To-do list: o Identify acceleration opportunities o Localize memory o Stream data and overlap computation Heavyweight loops Acceleration opportunities: o Heavyweight loops identified for acceleration o However, they are isolated in separate functions called through pointers

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 14 Large constant lookup tables identified How the next-generation compiler will do it (2) To-do list: Identify acceleration opportunities o Localize memory o Stream data and overlap computation Localize memory: o Pointer analysis identifies localizable memory objects o Private tables inside accelerator initialized once, saving most traffic Initialization code identified

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 15 How the next-generation compiler will do it (3) To-do list: Identify acceleration opportunities Localize memory o Stream data and overlap computation Streaming and computation overlap: o Memory dataflow summarizes array/pointer access patterns o Opportunities for streaming are automatically identified o Unnecessary memory operations replaced with streaming Summarize input access pattern Summarize output access pattern Constant table privatized

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 16 How the next-generation compiler will do it (4) To-do list: Identify acceleration opportunities Localize memory Stream data and overlap computation Achieve macropipelining of parallelizable accelerators o Upsampling and color conversion can stream to each other o Optimizations can have substantial effect on both efficiency and performance

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 17 Memory dataflow in the pointer world u Arrays are not true 3D arrays (unlike in Fortran) u Actual implementation: array of pointers to array of samples u New type of dataflow problem – understanding the semantics of memory structures instead of true arrays Array of constant pointers Row arrays never overlap

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 18 Compiler vs. hardware memory walls u Hardware memory wall s Prohibitive implementation cost of memory system while trying to keep up with the processor speed under power budget u Compiler memory wall s The use of memory as a generic pool obstructs compiler’s view of true program and data structures u The decentralized and diversified memory approach is key to breaking the hardware memory wall u Breaking the compiler memory wall will be increasingly important in breaking the hardware memory wall

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 19 Pointer analysis: sensitivity, stability and safety Improved efficiency increases the scope over which unique, heap- allocated objects can be discovered Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools [PASTE2004]

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 20 Pointer analysis: sensitivity, stability and safety u Analysis is abstract execution s simplifying abstractions → analysis stability s “unrealizable dataflow” results u Many components of accuracy s Typical to cut some corners to enable “key” component for particular applications u Making the components usefully compatible is a major contribution s No need for a priori corner-cutting → better results across broad code base u Safety in “unsafe” languages s C poses major challenges s Efficiency challenge increased in safe algos.

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 21 How do sensitivity, stability and safety coexist? u Our two-pronged approach to sensitive, stable, safe pointer analysis Summarization: Only relevant details are forwarded to a higher level Containment: The algorithm can cut its losses locally (like a bulkhead) … … to avoid a global explosion in problem size u Example: summarization-based context sensitivity…

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 22 Context sensitivity: naïve inlining Retention of side effect still leads to spurious results Excess statements unnecessary and costly

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 23 Context sensitivity: summarization-based Now, only correct result derived Compact summary of jade used Summary accounts for all side-effects. BLOCK assignment to prevent contamination

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 24 Analyzing large, complex programs Bench- mark INACCURATE Context Insensitive (seconds) (seconds) PREV Context- Sensitive (seconds) NEW Context- Sensitive (seconds) espresso291 li ijpeg2851 perl gcc52HOURS124 perlbmk155MONTHS198 gap vortex51363 twolf121 This results in an efficient analysis process without loss of accuracy Originally, problem size exploded as more contexts were encountered New algorithm contains problem size with each additional context [SAS2004]

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 25 The outlook in software u Software is changing too, more gradually u Applications driving development – rich in parallelism s Physical world – medicine, weather s Video, games – signal & media processing u Source code availability s Open Source continues to grow s Microsoft’s Phoenix Compiler Project u New programming models s Enhanced developer productivity & enhanced parallelism

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 26 Beyond the traditional language environment u Domain-specific, higher-level modeling languages s More intuitive than C for inherently parallel problems s Implementation details abstracted away from developers t increased productivity, increased portability u Still an important role for the compiler in this domain s Little visibility “through” the model for low-level optimization by developers t communication, memory optimization will be critical in next-gen systems s Model can provide structured semantics for the compiler, beyond what can be derived from analysis of low-level code u As new system models are developed, compilers, modeling languages, and developers will take on new, interactive roles

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 27 Domain-specific modeling and optimization u Programming Model provides the compiler with information that one cannot extract with analysis alone u Compiler breaks the limitations that are imposed by the model, allowing for efficient, high-performance binaries

SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 28 Concluding thoughts u Reaching the true potential of multi-everything hardware s Scalability requires distributed parallelism and memory models s Requires new compilation tools to break compiler memory wall u Broad suite of analyses necessary s Advanced pointer analysis s Memory dataflow analysis s New interactions of classical analyses u This is not just reinventing HPF s New distributed parallelism paradigms s New applications  new challenges! u As the field develops, new domain-specific programming models will also benefit from advanced compilation technology