Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,

Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil, Norman Miller, Marghoob Mohiyuddin, Woo-Sun Yang Lawrence Berkeley National Laboratory Dave Randall, Celal Konor, Ross Heikes Colorado State University Hiuro Miura University of Tokyo

Hardware/Software Co-design philosphy “Design machines to answer specific scientific questions rather than limit our questions by available machines.”

Hardware/Software Co-Design and climate modeling Global cloud system resolving models (GCSRM) ~1km scale in horizontal by ~100 vertical levels What are the computational demands to use a GCSRM for climate change studies? –How general are these requirements? What are the implications for the required hardware? –Cost, power requirements, time frame?

Global Cloud System Resolving Climate Modeling Direct simulation of cloud systems replacing statistical parameterization. This approach recently was called for by the 1st WMO Modeling Summit. Championed by Prof. Dave Randall, Colorado State University Direct simulation of cloud systems in global models requires exascale! Individual cloud physics fairly well understood Parameterization of mesoscale cloud statistics performs poorly.

Global Cloud System Resolving Models are a Transformational Change 1km Cloud system resolving models 25km Upper limit of climate models with cloud parameterizations 200km Typical resolution of IPCC AR4 models Surface Altitude (feet)

GCSRM: Global Cloud System Resolving Model –Replace cumulus parameterizations with direct simulation –Cloud microphysics is still parameterized Our tasks is to build a model to quantify code requirements by measuring and extrapolating the parts Four parts for the atmospheric model: –Dynamics –Fast physics (cloud processes and turbulence) –Slow physics (radiation transport) –MultiGrid solver (elliptic equation solution) Code requirements model will predict necessary flops, memory, communication, memory i/o to achieve the throughput goals. –Target is to simulate time 1000 times faster than real time.

HW/SW co-design strategy Diagnose computational aspects of each portion of the model –Operations count –Total memory –Instruction mix Propose a strawman computer design and calculate machine specific requirements –# of processors –processor sustained flop rate, local memory and cache size –Communications topology bandwidth and latency. Iterate on this process…

How is co-design different than other approaches? It is not GPU The code is not contained in the hardware. Architecture is targeted towards a specific class of scientific problem. –Fully programmable. –The class could be narrow or broad. –Design is optimized under a series of constraints chosen by application needs and power issues. –Applications outside the targeted class will run poorly. Such as LinPack.

CSU atmospheric model Target resolution is obtained after 12 bisections 167,772,162 vertices, ~128 vertical levels, ~1.75 km Ross Heikes CSU

Computational rate 28Pflops sustained to integrate the CSU GCSRM at 1000 times faster than actual time

Total memory 1.8PB at the target resolution

Traditional Sources of Performance Improvement are Flat-Lining New Constraints 15 years of exponential clock rate growth has ended Moore’s Law reinterpreted: How to leverage transistors to increase performance at historical rates? Power is the new design constraint. Nonlinear: CPU speed & size Multicore: # cores double 18-24 months. Accelerating supercomputing demand: End of straightforward serial improvements Much higher parallelism will be required to exploit this technology! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith

Path to Power Efficiency Reducing Waste in Computing The portable consumer electronics market: Optimized for low power, low cost, and high computational efficiency “Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste.”  Mark Horowitz, Stanford University & Rambus Inc. Sources of Waste: Wasted transistors (surface area) Wasted computation (useless work/speculation/stalls) Wasted bandwidth (data movement)

Consumer Electronics has Replaced PCs as the Dominant Market Force in CPU Design Apple Introduces IPod IPod+ITunes exceeds 50% of Apple’s Net Profit Apple Introduces Cell Phone (iPhone) Netbooks based on Intel Atom embedded processor is the fastest growing portion of “laptop” market. From Tsugio Makimoto: ISC2006

Design for Low Power: More Concurrency Intel Core2 15W Power 5 120W PPC450 3W Tensilica DP 0.09W IBM Power5 (server) 120W@1900MHz Cubic power improvement with lower clock rate due to V 2 F Intel Core2 sc (laptop) 15W@1000MHz Slower clock rates enable use of simpler cores – shorter pipelines, less area, lower leakage IBM PPC 450 (BG/P - low power) 3W@800MHz Tailor design to application to reduce waste Tensilica XTensa (Moto Razor) 0.09W@600MHz * Circa 2008

Low Power Design Principles Power 5 Even if each core operates at 1/3 of the frequency of fastest available processors, you can pack 100s of simple cores onto a chip and consume 1/10 the power One IBM Power5 120W 1900MHz 128 Tensilica Xtensa DP 11.5W equivalent to 76,800MHz If the application has enough parallelism the many-core chip can be much faster and consume less power.

hardware/software co-design procedures Embedded processors –The chip design process is the commodity, not the chip itself. –Tailor the instruction set to GCSRMs. Auto-tuning, a machine driven approach to performance optimization. Rapid emulation of the entire code on candidate chip designs. –allows an iterative approach. –Co-tuning of loop structure and hardware specifications. –Attain better computational efficiency than from a general purpose processor at lower power

CS101: How to design a power efficient computer Spec out the requirements of your code. –Aim for a class of codes, not just one Learn how to design processors and interconnects. –We obtained chip design tools from Tensilica, a leading designer of chips for cell phones and other consumer elextronics. Each chip design comes with its own C compiler and debugger Emulate your chip design on your code. –RAMP emulates chips with FPGAs Hardware emulation is more accurate and thousands of times faster than software emulation Iterate your chip design and your software. –Autotuning takes advantage of the specific C compiler –Profile the code to determine chip parameters.

Processor Generator (Tensilica) Build with any process in any fab Tailored SW Tools: Compiler, debugger, simulators, Linux, other OS Ports Derived HW characterics: size, power, etc. Application- optimized processor implementation (RTL/Verilog) Base CPU Apps Datapaths OCD Timer FPU Extended Registers Cache Embedded Design Automation (Example from Existing Tensilica Design Flow) Processor configuration 1.Select from menu 2.Automatic instruction discovery 3.Explicit instruction description Leverage mainstream tools, design processes, and commodity IP. Allows potential for HW/SW co-design Or emulate performance prior to fab

Advanced Hardware Simulation (RAMP) Research Accelerator for Multi-Processors (RAMP) Utilize FGPA boards to emulate multicore systems 1000x speedup versus software emulation Allows fast performance validation Emulates entire application (not just kernel) Break slow feedback loop for system designs Enables tightly coupled hardware/software/science co-design Technology partners: –UC Berkeley: John Wawrzynek, Jim Demmel, Krste Asanovic, Kurt Keutzer –Stanford/ Rambus Inc.: Mark Horowitz –Tensilica Inc.: Chris Rowen Faster Execution Increasing Accuracy RAMP C Model Spreadsheet ISS

Demonstration of Green Flash Approach SC ’09 Demo of approach feasibility CSU limited-area atmospheric model ported to Tensilica architecture Dual-core Tensilica processor running atmospheric model Multi-Core configurations now running MPI Routines ported to custom interconnect Emulation performance advantage Processor running at 25MHz vs. Functional model at 100 kHz (250x speedup) Actual code running - not representative benchmark

Auto-Tuning for Green Flash Challenge: How to optimize the climate code for the differing chip designs under consideration. Solution: Auto-tuning Different chip designs may require vastly different optimizations Labor-intensive Automate search across a complex optimization space “Never send a human to do a machine's job.” – Agent Smith, The Matrix

Autotuning Example Operators extracted from climate code before and after auto-tuning do k=0,km,1 do iprime=1,nside,1 do i=2,im2nghost-1,1 ia = i + ii(iprime) do j=2,jm2nghost-1,1 ja = j + jj(iprime) buoyancy_gen(i,j,iprime,k) =-1.0*g*(theta(ia,ja,k) - theta(i,j,k)) /(theta00(k)*el(iprime)) enddo do G14906=0,km,4 do G14907=1,nside,6 do G14908=2,im2nghost - 1,25 do G14909=2,jm2nghost - 1,25 do k=G14906,G14906 + 1,1 do iprime=G14907,G14907 + 5,1 do i=G14908,G14908 + 24,1 ia = i + ii(iprime) do j=G14909,G14909 + 24,1 ja = j + jj(iprime) buoyancy_gen(i,j,iprime,k) = -1.0 * g * theta(ia,ja,k) - theta(i,j,k) / theta00(k) * el(iprime) enddo do G14907=1,nside,6 do G14908=2,im2nghost - 1,25 do G14909=2,jm2nghost - 1,25 do k=G14906 + 2,G14906 + 3,1 do iprime=G14907,G14907 + 5,1 do i=G14908,G14908 + 24,1 ia = i + ii(iprime) do j=G14909,G14909 + 24,1 ja = j + jj(iprime) buoyancy_gen(i,j,iprime,k) = -1.0 * g * theta(ia,ja,k) - theta(i,j,k) / theta00(k) * el(iprime) enddo

Auto-tuning Results Shoaib Kamil UC Berkeley

Co-tuning: feedback on chip design We are fully profiling each loop in the code -> Auto-tuning can reduce the number of instructions but also changes the mix of instructions. –Iterate this information in the chip design Example: the loop with the largest footprint decreased the cache footprint from 160kb to 1kb halved the instruction count unique_addrs unique_clines footprint footprintcache tot_ins Bytes/Inst Bandwidth (MB/s) fpload fpstore FP Arith fpmov fprf FP L/S intload intstore intmov Int Arith Int L/S j/call branch control loop entry Control bitwise shift Logic

Hardware/Software Co-tuning The information from auto-tuning is relevant to the optimal chip design Iterate on the design process and autotuning Marghoob Mohiyuddin, UC Berkeley

Holistic Hardware/Software Co-Design Synthesize SoC (Hours) Cycle Time 1-2 Days Build Application Emulate Hardware RAMP (Hours) Auto-Tune Software (Hours- days) How long does it take for a full scale application to influence architectures?

Strawman computer Level 12 grid (~2km) –167,772,162 vertices, ~128 vertical levels A strawman design concept –2,621,440 horizontal subdomains (logically rectangular, 8x8 cells each) –8 vertical subdomains of 16 levels each. 20,971,520 processing cores.

Many-core chips  nested parallelism Red indicates off-chip data transfer Blue indicates on-chip data transfers ~100 times faster than off-chip Hardware requirements –650Mhz/core, 1.3Gflops/core, 256KB/core cache, 200,000 msg/sec latency 45nm, 128 processors, 163,840 chips –4x4x8 subdomains/chip, 9.2GB/sec bandwidth 22nm, 512 processors, 40,960 chips –8x8x8 subdomains/chip, 37GB/sec bandwidth a 4x4 “superdomain”

Computational rate 28Pflops sustained to integrate the CSU GCSRM at 1000 times faster than actual time

Green Flash 100PF (peak) Strawman System Design We examined three different approaches (in 2009 technology) AMD Opteron: Commodity approach, lower efficiency for scientific codes offset by advantages of mass market. Constrained by legacy/binary compatibility. BlueGene: Generic embedded processor core and customize system-on- chip (SoC) to improve power efficiency for scientific applications Tensilica XTensa: Customized embedded CPU w/SoC provides further power efficiency benefits but maintains programmability. Mainstream design process, tool chain, commodity IP ProcessorClockPeak/ Core (Gflops) Cores/ Socket SocketsCoresPower AMD Opteron2.2GHz8.861.8M11M142MW IBM BG/P0.8GHz3.447M29M198MW Green Flash / Tensilica XTensa 1GHz4640.4M25M5 MW

Scalability 20,971,520 processor cores sustaining 1.3Gflop apiece. Is this speed realistic? –Perhaps, but not faster than that. Our strawman design defined subdomains to contain 8x8x16 cells. –Smaller than that could lead to communication bottlenecks. Moving to the level 13 grid (~1km) would not meet the 1000X target integration rate. –83,886,080 processors at 2.6Gflop Probably an unrealistic single core sustained rate.

Toward exaflops Courant condition scaling provides a hard limit to an achievable resolution using only domain decomposition strategies!

Another strategy We have already exploited fast on-chip communication. –Communication between vertical subdomains is a whole volume exchange rather than a perimeter exchange. –MG prolongation and restriction steps. To surpass our 200Pflop barrier, we can exploit the fast on-chip communication in a process decomposition strategy.

Process Decomposition Already in use as a strategy to increase processor count: –CCSM4 uses separate processors for atmosphere, ocean, sea ice, land, glaciers component models. –Only surface data is exchanged between components. Exaflops could be achieved only by using process parallelism with component models. –Volumetric data must be exchanged between processes. –Many-core chips would allow this because of the fast on- chip communications.

GCSRM Process Parallelism Potential candidates for process parallelism Physics vs. dynamics –Factor of 2 if time steps are equal –Not that interesting if time steps are unequal. Advection vs. rest of dynamics. –Could be useful in a code like CESM1.0 with 21 advected scalars.

One last thought Up to now, we have considered chip designs where all the processors are identical. This is not required by any hardware constraints. Imagine a highly specialized architecture where different processors are co-designed to specific parts of the GCSRM. –Some to advection, some to radiation, some to momentum equation, some to continuity equation, etc. –Mix single and double precision processors? –Fractions determined by load balancing.

Heterogeneous many-core chips? Such a heterogeneous co-design is clearly far more restrictive than what we have proposed before. Is this inevitable in a post-exaflop world?

Future work Demonstrate a co-design for more than one GCSRM –Can a set of compromises be made so that a co-design reaches the target performance goals for multiple models? –Cubed sphere (NASA, GFDL or HOMME) Proof of concept machine? –Target the 10-20km resolution range Find a sponsor who is not risk adverse…

Summary Ultra-high resolution climate modeling will require exascale computing –And does not have to be far into the future! Efficient exascale computing requires new approaches to the relationship between hardware and software. Hardware/software co-design can reduce cost and improve efficiency by eliminating waste. The consumer electronics industry provides a design technology to permit affordable but specialized scientific computers. –How different are computational requirements across classes of climate models?

A concluding thought At the exascale, numerical experimentalists must take a lesson from actual experimentalists. Design machines to answer specific scientific questions rather than limit our questions by available machines.

Thank You! mfwehner@lbl.gov

Fault Tolerance/Resilience Our Design does not expose unique risks –Faults proportional to # sockets (not # cores) and silicon surface area –We expose less surface area and fewer sockets with our approach Hard Errors –Spare cores in design (Cisco Metro) –SoC design (fewer components and fewer sockets) –Use solder (not sockets) Soft Errors –ECC for memory and caches –On-board NVRAM controller for localized checkpoint –Checkpoint to neighbor for rollback

Example Design Study Green Wave: Seismic Imaging Developed RTL design for SoC in 45 nm technology using off-the-shelf embedded technology + simulated with RAMP FPGA platform Power Breakdown (70W total for SoC+ memory) Area Breakdown (240 mm 2 for SoC) –Tensilica LX2 processor core –off-the-shelf 4-slot SP SIMD, ECC –4-slot VLIW (FLIX) –cache hierarchy with Local store + conventional cache. –TIE queues interprocessor messaging –NoC fabric for SoC services –128 cores –4 DDR3 1600 memory controllers –4x 10Gig Ethernet

Example Design Study Green Wave: Seismic Imaging Compare to Nehalem, Fermi –All 40-45nm technology –Nehalem and Green Wave are 240mm 2 die area with conventional DDR3 memory (same memory perf) –Fermi c2050 is 540mm 2 die with DDR5 memory (3x higher memory bandwidth) 8 th order RTM kernel performance –Nehalem auto-tuned to within 15% of theoretical performance limit by Sam Williams –Fermi hand-tuned by Paulius Mickavelius of Nvidia Off-the-shelf embedded ASIC unable to beat Fermi (but within ~30%) However, offers huge gain in energy efficiency –Fermi burdened by host (green circle is if you had Fermi without host) –Green wave also has advantage of not including anything you don’t need for RTM Fermi and Nehalem have to include a lot of extra stuff for other markets such as Graphics. Nehalem also must maintain legacy binary compatibility

Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,

Similar presentations

Presentation on theme: "Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,

Similar presentations

Presentation on theme: "Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,"— Presentation transcript:

Similar presentations

About project

Feedback