DUSD(Labs) Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steven S. Lumetta University of Illinois at Urbana-Champaign
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 2 Trends in hardware u High variability s Increasing speed and power variability of transistors s Limited frequency increase s Reliability / verification challenges u Large interconnect delay s Increasing interconnect delay and shrinking clock domains s Limited size of individual computing engines Interconnect RC Delay Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect Data: Shekhar Borkar, Intel
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 3 Trends in architecture u Transistors are free… until connected or used u Continued scaling of traditional processor core no longer economically viable s 2-3X effective area yields ~1.6X performance [PollackMICRO32] s Verification, power, transistor variability u Only obvious scaling route: “Multi-Everything” s Multi-thread, multi-core, multi-memory, multi-? s CW: Distributed parallelism is easy to design u But what about software? s If you build a better mousetrap…
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 4 A “multi-everything” processor of the future u Distributed, less complex components s Variability, power density, and verification – easier to address u Who bears the SW mapping burden? s General purpose software changes prohibitively expensive (cf. SIMD, IA-64) s Advanced compiler features “Deep Analysis” s New programming models / frameworks s Interactive compilers
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 5 General purpose processor component(s) u The system director u Performs traditionally- programmed tasks s software migration starts here u Likely multiple GPP’s u Less complex processor cores
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 6 Computational efficiency through customization u Goal: Offload most processing to more specialized, more efficient units u Application Processors (APP) s Specialized instruction sets, memory organizations and access facilities u Programmable Accelerators (ACC) s Think ASIC with knobs s Highly-specialized pipelines s Approximate ASIC design points u Higher performance/watt than general purpose for target applications
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 7 Memory efficiency through diversity u Traditional monolithic memory model – major power / performance sink u Need partnership of general- purpose memory hierarchy and software-managed memories u Local memories will reduce unnecessary memory traffic and power consumption u Bulk data transfer scheduled by Memory Transfer Module u Software will gradually adopt decentralized model for power and bandwidth
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 8 Tolerating communication & adding macropipelining u Bulk communication overhead often substantial for traditional accelerators u Shared memory / snooping communication approach limits available bandwidth u Compilation tools will have to seamlessly connect processors and accelerators u Accelerators will be able to operate on bulk transferred, buffered data… … or on streamed data
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 9 Embedded systems already trying out this paradigm XScale Core Hash Engine Scratch- pad SRAM RFIFO Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine QDR SRAM QDR SRAM QDR SRAM QDR SRAM RDRAM PCI CSRs TFIFO SPI4 / CSIX Intel IXP1200 Network Processor Philips Nexperia (Viper) ARM MICRO- ENGINES ACCESS CTL. MIPS MPEG VLIW VIDEO MSP Intel IXP2400 Network Processor
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 10 Decentralizing parallelism in a JPEG decoder u Convert a typical media-processing application to the decentralized model s Arrays used to implement streams s Multiple loci of computation with various models of parallelism s Memory access bandwidth a bottleneck w/o private data Conceptual dataflow view of two JPEG decoding steps
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 11 Data privatization and local memory Conceptual dataflow view of two JPEG decoding steps u Accelerate color conversion first (execute in ACC or APP) s Main processor sends inputs, receives outputs u Large tables – inefficient to send data from main processor s Need tables to reside in the accelerator for efficiency of access s Tables are initialized once during program execution, and never modified again s Accurate pointer analysis necessary to determine this
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 12 Increasing parallelism u Heavyweight loop nests communicate though intermediate array u Direct streaming of data is possible, supports higher parallelism (macropipelining) Convert() and Upsample() loops can be chained u Accurate interprocedural dataflow analysis is necessary
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 13 How the next-generation compiler will do it (1) To-do list: o Identify acceleration opportunities o Localize memory o Stream data and overlap computation Heavyweight loops Acceleration opportunities: o Heavyweight loops identified for acceleration o However, they are isolated in separate functions called through pointers
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 14 Large constant lookup tables identified How the next-generation compiler will do it (2) To-do list: Identify acceleration opportunities o Localize memory o Stream data and overlap computation Localize memory: o Pointer analysis identifies localizable memory objects o Private tables inside accelerator initialized once, saving most traffic Initialization code identified
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 15 How the next-generation compiler will do it (3) To-do list: Identify acceleration opportunities Localize memory o Stream data and overlap computation Streaming and computation overlap: o Memory dataflow summarizes array/pointer access patterns o Opportunities for streaming are automatically identified o Unnecessary memory operations replaced with streaming Summarize input access pattern Summarize output access pattern Constant table privatized
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 16 How the next-generation compiler will do it (4) To-do list: Identify acceleration opportunities Localize memory Stream data and overlap computation Achieve macropipelining of parallelizable accelerators o Upsampling and color conversion can stream to each other o Optimizations can have substantial effect on both efficiency and performance
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 17 Memory dataflow in the pointer world u Arrays are not true 3D arrays (unlike in Fortran) u Actual implementation: array of pointers to array of samples u New type of dataflow problem – understanding the semantics of memory structures instead of true arrays Array of constant pointers Row arrays never overlap
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 18 Compiler vs. hardware memory walls u Hardware memory wall s Prohibitive implementation cost of memory system while trying to keep up with the processor speed under power budget u Compiler memory wall s The use of memory as a generic pool obstructs compiler’s view of true program and data structures u The decentralized and diversified memory approach is key to breaking the hardware memory wall u Breaking the compiler memory wall will be increasingly important in breaking the hardware memory wall
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 19 Pointer analysis: sensitivity, stability and safety Improved efficiency increases the scope over which unique, heap- allocated objects can be discovered Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools [PASTE2004]
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 20 Pointer analysis: sensitivity, stability and safety u Analysis is abstract execution s simplifying abstractions → analysis stability s “unrealizable dataflow” results u Many components of accuracy s Typical to cut some corners to enable “key” component for particular applications u Making the components usefully compatible is a major contribution s No need for a priori corner-cutting → better results across broad code base u Safety in “unsafe” languages s C poses major challenges s Efficiency challenge increased in safe algos.
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 21 How do sensitivity, stability and safety coexist? u Our two-pronged approach to sensitive, stable, safe pointer analysis Summarization: Only relevant details are forwarded to a higher level Containment: The algorithm can cut its losses locally (like a bulkhead) … … to avoid a global explosion in problem size u Example: summarization-based context sensitivity…
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 22 Context sensitivity: naïve inlining Retention of side effect still leads to spurious results Excess statements unnecessary and costly
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 23 Context sensitivity: summarization-based Now, only correct result derived Compact summary of jade used Summary accounts for all side-effects. BLOCK assignment to prevent contamination
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 24 Analyzing large, complex programs Bench- mark INACCURATE Context Insensitive (seconds) (seconds) PREV Context- Sensitive (seconds) NEW Context- Sensitive (seconds) espresso291 li ijpeg2851 perl gcc52HOURS124 perlbmk155MONTHS198 gap vortex51363 twolf121 This results in an efficient analysis process without loss of accuracy Originally, problem size exploded as more contexts were encountered New algorithm contains problem size with each additional context [SAS2004]
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 25 The outlook in software u Software is changing too, more gradually u Applications driving development – rich in parallelism s Physical world – medicine, weather s Video, games – signal & media processing u Source code availability s Open Source continues to grow s Microsoft’s Phoenix Compiler Project u New programming models s Enhanced developer productivity & enhanced parallelism
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 26 Beyond the traditional language environment u Domain-specific, higher-level modeling languages s More intuitive than C for inherently parallel problems s Implementation details abstracted away from developers t increased productivity, increased portability u Still an important role for the compiler in this domain s Little visibility “through” the model for low-level optimization by developers t communication, memory optimization will be critical in next-gen systems s Model can provide structured semantics for the compiler, beyond what can be derived from analysis of low-level code u As new system models are developed, compilers, modeling languages, and developers will take on new, interactive roles
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 27 Domain-specific modeling and optimization u Programming Model provides the compiler with information that one cannot extract with analysis alone u Compiler breaks the limitations that are imposed by the model, allowing for efficient, high-performance binaries
SIGMICRO Online Seminar—January 18, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 28 Concluding thoughts u Reaching the true potential of multi-everything hardware s Scalability requires distributed parallelism and memory models s Requires new compilation tools to break compiler memory wall u Broad suite of analyses necessary s Advanced pointer analysis s Memory dataflow analysis s New interactions of classical analyses u This is not just reinventing HPF s New distributed parallelism paradigms s New applications new challenges! u As the field develops, new domain-specific programming models will also benefit from advanced compilation technology