DUSD(Labs) Breaking the Memory Wall for Scalable Microprocessor Platforms Wen-mei Hwu with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd, Dan R. Burke, Nacho Navarro, Steve S. Lumetta University of Illinois at Urbana-Champaign
PACT Keynote, October 1, 2004 Semiconductor computing platform challenges performance billion transistors powercost DSP/ASIP security Intelligent RAM feature set reliability Mem. Latency/Bandwidth Power Constraints Microprocessors Reconfigurability accelerators O/S limitations S/W inertia wire load process variation leakage fab cost
PACT Keynote, October 1, 2004 ASIC/ASIP economics u Optimistically, ASIC/ASSP revenues growing 10–20% / year s Engineering portion of budget is supposed to be trimmed every year (but never is) s Chip development costs rising faster than increased revenues and decreased engineering costs can make up the difference s Implies 40% fewer IC designs (doing more applications) - every process generation!! Number of IC Designs ≤ Per-chip Development Cost Total ASIC/ASSP Revenues Engineering Costs × 10-20% 5-20% 40% %
PACT Keynote, October 1, 2004 ASIPs: non-traditional programmable platforms Level of concurrency must be comparable to ASICs XScale Core Hash Engine Scratch- pad SRAM RFIFO Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine Micro engine QDR SRAM QDR SRAM QDR SRAM QDR SRAM RDRAM PCI CSRs TFIFO SPI4 / CSIX ASIPs will be on-chip, high- performance multi-processors
PACT Keynote, October 1, 2004 Example embedded ASSP implementations Intel IXP1200 Network Processor Philips Nexperia (Viper) MIPS VLIW
PACT Keynote, October 1, 2004 What about the general purpose world u Clock frequency increase of computing engines is slowing down s Power budget hinders higher clock frequency s Device variation limits deeper pipelining s Most future perf. improvement will come from concurrency and specialization u Size increase of single-thread computing engines is slowing down s Power budget limits number of transistors activated by each instruction s Need finer-grained units for defect containment s Wire delay is becoming a primary limiter in large, monolithic designs u The approach to covering all applications with a primarily single execution model is showing limitations
PACT Keynote, October 1, 2004 Impact of Transistor Variations 130nm 30% 5X Frequency~30%LeakagePower~5X Normalized Leakage (I sb ) Normalized Frequency Source: Shekhar Borkar, Intel
PACT Keynote, October 1, 2004 Metal Interconnects Interconnect RC Delay Source: Shekhar Borkar, Intel Line Cap (Relative) Low-K ILD Line Res (Relative) Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect RC Delay (Relative) 0.7x Scaled RC Delay
PACT Keynote, October 1, 2004 Measured SPECint2000 Performance on real hardware with same fabrication technology Date: October 2003
PACT Keynote, October 1, 2004 General processor cores u Very low power compute and memory structures u O/S provides lightweight access to custom features Acceleration logic u Application specific logic u High-bandwidth, distributed storage (RAM, registers) u To developer, behave like software components Memory system u Data delivery to processor u O/S and virtual memory issues u Intelligent memory controllers Application processors u Lightweight compute engines u High-bandwidth, distributed storage (RAM, registers) u High-bandwidth, scalable interconnect Convergence of future computing platforms
PACT Keynote, October 1, 2004 Breaking the memory wall with distributed memory and data movement
PACT Keynote, October 1, 2004 Parallelization with deep analysis: Deconstructing von Neumann [IWLS2004] u Memory dataflow that enables s Extraction of independent memory access streams s Conversion of implicit flows through memory into explicit communication u Applicability to mass software base requires pointer analysis, control flow analysis, array dependence analysis CPU Weight_Ai (Az, F_ga3, Ap3) Weight_Ai (Az, F_g4, Ap4) Residu (Ap3, &syn_subfr[i],) Copy (Ap3, h, 11) Set_zero (&h[11], 11) Syn_filt (Ap4, h, h, 22, &h) tmp = h[0] * h[0]; for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i]; tmp1 = tmp >> 8; tmp = h[0] * h[1]; for (i = 1 ; i < 21 ; i++) tmp = tmp + h[i] * h[i+1]; tmp2 = tmp >> 8; if (tmp2 <= 0) tmp2 = 0; else tmp2 = tmp2 * MU; tmp2 = tmp2/tmp1; preemphasis (res2, temp2, 40) Syn_filt (Ap4, res2, &syn_p), 40, mem_syn_pst, 1); agc (&syn[i_subfr], &syn) 29491, 40) res2 m_syn F_g3 F_g4 Az_4 synth syn Ap3 Ap4 h tmp tmp1 tmp2 CPU DRAM DRAMDRAM Weight_Ai Copy+ Set_zero Residu Syn_filt Corr0/Corr1 preemph agc Syn_filt PE’s res2 m_syn F_g3 F_g4 Az_4 synth syn Ap3 Ap4 h tmp tmp1 tmp2 PE’s DRAM
PACT Keynote, October 1, 2004 Memory bottleneck example (G.724 Decoder Post-filter, C code) u Problem: Production/consumption occur with different patterns across 3 kernels Anti-dependence in preemphasis function (loop reversal not applicable) s Consumer must wait until producer finishes u Goal: Convert memory access to inter-cluster communication ** * * + Residu preemphasis **** + Syn_filt res [0:39] [39:0] [0:39] MEM time
PACT Keynote, October 1, 2004 Breaking the memory bottleneck u Remove anti-dependence by array renaming u Apply loop reversal to match producer/consumer I/O u Convert array access to inter- component communication ** * * + Residu + **** Syn_filt res preemphasis res2 time Interprocedural pointer analysis + array dependence test + array access pattern summary+ interprocedural memory data flow
PACT Keynote, October 1, 2004 u Full system environment s Linux running on PowerPC s Lean system with custom Linux (Nacho Navarro, UIUC/UPC) s Virtex 2 Pro FPGA logic treated as software components u Removing memory bottleneck s Random memory access converted to dataflow s Memory objects assigned to distributed Block RAM u SW / HW communication s PLB vs. OCM interface A prototyping experience with the Xilinx ML300
PACT Keynote, October 1, 2004 Initial results from our ML300 testbed u Case study: GSM vocoder s Main filter in FPGA s Rest in software running under Linux with customized support s Straightforward software/ accelerator communications pattern s Fits in available resources on Xilinx ML300 V2P7 s Performance compared to all- software execution, with communication overhead Projected filter latency ~8x ~32x Cycles SoftwareNaïveOptimized Hardware implementation
PACT Keynote, October 1, 2004 Applications and Systems Software Applications and Systems Software Grand challenge u Moving the mass-market software base to heterogeneous computing architectures s Embedded computing platforms in the near term s General purpose computing platforms in the long run Platforms Programming models Restructuring compilers Communications and storage management Accelerator architectures OS support
PACT Keynote, October 1, 2004 Slicing through software layers
PACT Keynote, October 1, 2004 Taking the first step: pointer analysis u To what can this variable point? (points-to) s Can these two variables point to the same thing? (alias) s Fundamental to unraveling communications through memory: programmers like modularity and pointers! u Pointer analysis is abstract execution s Model all possible executions of the program s Has to include important facets, or result won’t be useful s Has to ignore irrelevant details, or result won’t be timely s Unrealizable dataflow = artifacts of “corners cut” in the model u Typically, emphasis has been on timeliness, not resolution, because expensive algorithms cause unstable analysis time – for typical alias uses, may be OK… u …but we have new applications that can benefit from higher accuracy s Data flow unraveling for logic synthesis and heterogeneous systems
PACT Keynote, October 1, 2004 How to be fast, safe and accurate? u An efficient, accurate, and safe pointer analysis based on the following two key ideas Efficient analysis of a large program necessitates that only relevant details are forwarded to a higher level component The algorithm can locally cut its losses (like a bulkhead) … … to avoid a global explosion in problem size
PACT Keynote, October 1, 2004 One facet: context sensitivity u Context sensitivity – avoids unrealizable data flow by distinguishing proper calling context What assignments to a and g receive? CI: a and g each receive 1 and 3 CS: g receives only 1 and a receives only 3 u Typical reactions to CS costs s Forget it, live with lots of unrealizable dataflow s Combine it with a “cheapener” like the lossy compression of a Steensgaard analysis u We want to do better, but we may sometimes need to mix CS and CI to keep analysis fast Desired results Example
PACT Keynote, October 1, 2004 Context Insensitive (CI) u Collecting all the assignments in the program and solving them simultaneously yields a context insensitive solution u Unfortunately, this leads to three spurious solutions.
PACT Keynote, October 1, 2004 Context Sensitive (CS): Naïve process Retention of side effect still leads to spurious results Excess statements unnecessary and costly
PACT Keynote, October 1, 2004 CS: “Accurate and Efficient” approach Now, only correct result derived Compact summary of jade used Summary accounts for all side-effects. DELETE assignment to prevent contamination
PACT Keynote, October 1, 2004 Analyzing large, complex programs [SAS2004] Bench- mark INACCURATE Context Insensitive (seconds) (seconds) PREV Context- Sensitive (seconds) NEW Context- Sensitive (seconds) espresso291 li ijpeg2851 perl gcc52HOURS124 perlbmk155MONTHS198 gap vortex51363 twolf121 This results in an efficient analysis process without loss of accuracy Originally, problem size exploded as more contexts were encountered New algorithm contains problem size with each additional context
PACT Keynote, October 1, 2004 Example application and current challenges [PASTE2004] Improved efficiency increases the scope over which unique, heap- allocated objects can be discovered Example: Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools
PACT Keynote, October 1, 2004 From benchmarks to broad application code base u The long term trend is for all code to go through a compiler and be managed by a runtime system s Microsoft code base to go through Phoenix – OpenIMPACT participation s Open source code base to go through GCC/OpenIMPACT under Gelato u The compiler and runtime will perform deep analysis to allow tool to have visibility into software s Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration, memory managers, runtime, etc. systems Applications systems Operating systems systems Libraries systems Compiler systems Runtime and Tools systems Hardware
PACT Keynote, October 1, 2004 Global memory dataflow analysis u Integrates analyses to deconstruct memory “black box” s Interprocedural pointer analysis: allow programmer to use language and modularity without losing transformability s Array access pattern analysis: figure out communication among loops that communicate through arrays s Control and data flow analyses: enhance resolution by understanding program structure s Heap analysis extends analysis to much wider software base u SSA-based inductor detection and dependence test have been integrated into IMPACT environment
PACT Keynote, October 1, 2004 foo (int *s, int L) { int *p=s, i; for (i=0; i<L; i++) *p =...; p++; } foo writes A[0:63] stride 1 bar reads A[1:64] stride 1 procedure call parameter mapping Read from *(t) to *(t+M) to *(t+M) with stride 1 with stride 1 Procedure body summary for the whole loop Write *p loop body main(...) { int A[100]; foo(A, 64); foo(A, 64); bar(A+1, 64) bar(A+1, 64) } bar (int *t, int M) { int *q=t, i; for (i=0; i<M; i++) … = *q; q++; } Write from *(s) to *(s+L) to *(s+L) with stride 1 with stride 1 Read *q Data flow analysis determines that A[64] is not from foo Pointer relation analysis restates p/q in terms of s/t Example on deriving memory data flow
PACT Keynote, October 1, 2004 Conclusions and outlook u Heterogeneous multiprocessor systems will be the model for both general purpose and embedded computing platforms in the future s Both are motivated by powerful trends s Shorter term adoption for embedded systems s Longer term for general purpose systems u Programming models and parallelization of traditional programs to channel software to these new platforms s Feasibility of deep pointer analysis demonstrated s Many need to participate in solving this grand challenge problem