Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.

Slides:



Advertisements
Similar presentations
International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
3D-STAF: Scalable Temperature and Leakage Aware Floorplanning for Three-Dimensional Integrated Circuits Pingqiang Zhou, Yuchun Ma, Zhouyuan Li, Robert.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
Applied Linear Algebra - in honor of Hans SchneiderMay 25, 2010 A Look-Back Technique of Restart for the GMRES(m) Method Akira IMAKURA † Tomohiro SOGABE.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
1 A component mode synthesis method for 3D cell by cell calculation using the mixed dual finite element solver MINOS P. Guérin, A.M. Baudron, J.J. Lautard.
OpenFOAM on a GPU-based Heterogeneous Cluster
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
1 Closed-Loop Modeling of Power and Temperature Profiles of FPGAs Kanupriya Gulati Sunil P. Khatri Peng Li Department of ECE, Texas A&M University, College.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
More Realistic Power Grid Verification Based on Hierarchical Current and Power constraints 2 Chung-Kuan Cheng, 2 Peng Du, 2 Andrew B. Kahng, 1 Grantham.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Tutorial 5: Numerical methods - buildings Q1. Identify three principal differences between a response function method and a numerical method when both.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computer Graphics Graphics Hardware
Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
TSV-Aware Analytical Placement for 3D IC Designs Meng-Kai Hsu, Yao-Wen Chang, and Valerity Balabanov GIEE and EE department of NTU DAC 2011.
Finite Element Method.
An evaluation of HotSpot-3.0 block-based temperature model
VLSI Physical Design: From Graph Partitioning to Timing Closure Chapter 5: Global Routing © KLMH Lienig 1 EECS 527 Paper Presentation High-Performance.
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Jia Wang Electrical and Computer Engineering Illinois Institute of Technology Chicago, Illinois, United States November, 2012 Deterministic Random Walk.
Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Scalable Symbolic Model Order Reduction Yiyu Shi*, Lei He* and C. J. Richard Shi + *Electrical Engineering Department, UCLA + Electrical Engineering Department,
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
TSV-Constrained Micro- Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and Ankur Srivastava, University of Maryland, College Park, MD,
Parallel Solution of the Poisson Problem Using MPI
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Outline Introduction Research Project Findings / Results
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fast 3D Least-squares Migration with a Deblurring Filter Wei Dai.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Hui Liu University of Calgary
Xing Cai University of Oslo
Parallel Plasma Equilibrium Reconstruction Using GPU
Deflated Conjugate Gradient Method
GPU Implementations for Finite Element Methods
Thermal-ADI: a Linear-Time Chip-Level Dynamic Thermal Simulation Algorithm Based on Alternating-Direction-Implicit(ADI) Method Good afternoon! The topic.
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Home assignment #3 (1) (Total 3 problems) Due: 12 November 2018
Presentation transcript:

Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan Technological, Texas A&M} University ICCAD 2010

Outline Introductions Backgrounds GPU-based full-chip thermal analysis with microchannels Preconditioned iterative method on GPU Experimental results and conclusions

Introduction Effective thermal management for 3D-ICs is becoming increasingly challenging. Increasing power density and chip design complexity. Traditional heat sinks are expected to quickly reach their limits for meeting the cooling needs of 3D-ICs.

Introduction (cont.) The integrated on-chip microchannel cooling has been considered as a very promising solution. i.e. liquid cooling An experiment on a liquid-cooled 2D-IC. Peak on-chip temperature: from 85℃ to 57℃ Maximum temperature variation: from 25℃ to 6℃

Introduction (cont.) Existing design and optimization procedure for integrated microchannels are performed without considering the full-chip thermal profiles. May not provide the most “economic” solution Drawbacks: design complexity, packaging cost, etc. Hence, a comprehensive design and optimization flow should be closely coupled with the full-chip thermal analysis.

Why GPUs? Finite difference (FD) method is more suitable for general 3D full-chip thermal simulations. Accurate 3D thermal analysis in a full-chip scale using FD method can be very expensive, which requires solving a huge linear system of equations including multi-million unknowns.

Why GPUs? (cont.) GPU-based parallel computing has been employed in various electrical design automation areas. Advantages High computing power in large-scale homogeneous computing, i.e. matrix multiplications Significantly high memory bandwidth

Contributions Proposes novel GPU-based full-chip thermal simulation methods for 3D-ICs with integrated microchannel cooling GPU-friendly data structures and algorithm flows Proposes a GPU-friendly two-step block relaxation scheme that integrates block-based vertical-line relaxations and liquid-flow-direction relaxations. Achieves good speedup. More than 35x fast to the CPU-based solver More than 360x fast to the direct solution solver

Background – liquid cooling in 3D ICs The liquid-cooled microchannels are typically integrated inside a wafer-level package, where the microchannels are connected to the liquid inlets and outlets using fluidic through silicon vias (TSVs). The heat flux can be more effectively removed than ever before since the thermal resistance of such integrated liquidcooled heat sinks can be much lower than the thermal resistance of the traditional fan-cooled heat sinks.

Background – finite difference (FD) method Replacing derivative expressions with approximately equivalent difference quotients to approximate the solutions to differential equations. For some small h

Background – full-chip thermal simulation Discretize the PDE of the original thermal circuit analysis problem by FD method. Solve GT = b where G is the thermal resistance matrices. b is the information about the environment.

Background – GPU programming

Architecture of Nvidia GTX280 A collection of 30 multiprocessors, with 8 streaming processors each. The 30 multiprocessors share one off-chip global memory. Access time: about 300 clock cycles Each multiprocessor has a on-chip memory shared by that 8 streaming processors. Access time: 2 clock cycles

About some differences between GPU and CPU GPU (NVIDIA GeForce 8800 GTX) CPU (Intel Pentium 4) flops 345.6G ~12G Memory bandwidth 86.4GB/s (900MHz memory clock, 384 bit interface, 2 issues) 6.4GB/s (800MHz memory clock, 32 bit interface, 2 issues) Access time of global memory Slow (about 500 memory clock cycles) Fast (about 5 memory clock cycles)

GPU-based full-chip thermal analysis with microchannels Many things need to be considered for obtaining the most “economic” microchannel designs. Pumping power, placement, sizing, … Fine-grained thermal modeling and analysis including microchannel cooling is non-trivial due to the high modeling complexity and simulation costs. Model extraction cost and thermal simulation cost The characteristic is matched for GPU.

The proposed two-step block relaxation scheme Considers two directions (Z and Y) of heat dissipations.

Details In the first step, the nodes that are included in a block of vertical lines are selected for doing relaxations (lines L1 to L3 shown in Fig. 4). Such relaxations allow fast solution updates in the vertical heat dissipation paths within the block. In the second step, a few relaxations in the microchannel routing direction (liquid-flow direction) are performed to allow heat solution updates in the liquid-flow direction.

But why? Efficiencies of typical iterative methods usually depend on Efficiency of the sparse matrix-vector operations Effectiveness of the relaxation (iteration) scheme Existing iterative algorithms only focus vertical heat dissipations. Horizontal (plane) dissipations in traditional 2D ICs are negligible for relatively small thermal conductance But not in 3D ICs

Preconditioned iterative method on GPU Two critical issues about run time. Matrix representation format Convergence rate of iterative method Use and ELL-like format and preconditio-ning technique.

Matrix representation format GPU-based computations should guarantee that most of the global memory accesses are coalesced so that efficient data structure and its related memory accesses should be carefully designed. Use three 1D vector to fully represent the sparse matrix and fit memory coalescing. Diagonal, off-diagonal and its corresponding indices 2x to 3x compared with CSR format.

Example

Conjugate gradient (CG) method The CG method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The CG is an iterative method, so it can be applied to sparse systems that are too large to be handled by direct methods such as the Cholesky decomposition. Such systems often arise when numerically solving partial differential equations Minimize Assuming exact arithmetics, CG converges in at most n steps where n is the size of the matrix of the system (here n=2).

Preconditioning Conjugate gradient (CG) method takes too much iterations since the matrix is usually ill-conditioned. Condition number Moreover, the total runtime can be even greater than CG if the preconditioning method is bad or high runtime cost. Though #iteration is less Three ways for comparison CG, diagonal preconditioned (DP)CG, multi-grid preconditioned (MGP)CG

Preconditioning (cont.) Preconditioning is a procedure of an application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solution. Preconditioned system Preconditioned iterative method Practical preconditioner

Multi-grid preconditioner Actually not that clear but the idea is to coarsen the grid to reduce complexity.

Experimental results Environment Intel Core 2 Quad 2.66GHz with one NVIDIA GeForce GTX 285 DRAM: 6G for CPU, 2G for GPU C++ and CUDA on Linux Inlet water temperature: 50℃ A set of 3D design stack 6 2D dies. Convergence criterion of iterative solver: residual norm < 10^-6. The error is negligible.

Experimental results (cont.) Traditional smoothing is vertical line smooth. Significant speedup of at least 35x.

Conclusions Proposes GPU-based thermal simulation methods of 3D ICs with integrated liquid-cooled microchannels. GPU-friendly two-step block-based relaxation scheme. Highly accurate results with significant speed-up.