High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley www.cs.berkeley.edu/~demmel/Itanium_121001.ppt.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Modified Nodal Analysis for MEMS Design Using SUGAR Ningning Zhou, Jason Clark, Kristofer Pister, Sunil Bhave, BSAC David Bindel, James Demmel, Depart.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

NSF Foundations of Hybrid and Embedded Software Systems UC Berkeley: Chess Vanderbilt University: ISIS University of Memphis: MSI A New System Science.

Remote MEMS Simulation, Testing and Design James Demmel, EECS & Math Kristofer Pister, Richard Muller, BSAC, EECS, UC Berkeley Sanjay Govindjee, CEE Alice.

High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley

Automatic Performance Tuning of Numerical Kernels BeBOP: Berkeley Benchmarking and OPtimization Automatic Performance Tuning of Numerical Kernels BeBOP:

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Modeling MEMS Sensors [SUGAR: A Computer Aided Design Tool for MEMS ]

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.

Programming Tools and Environments: Linear Algebra James Demmel Mathematics and EECS UC Berkeley.

NSF Foundations of Hybrid and Embedded Software Systems UC Berkeley: Chess Vanderbilt University: ISIS University of Memphis: MSI A New System Science.

Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels James Demmel EECS and Math UC Berkeley Katherine Yelick.

SUGAR: A MEMS Simulator  Our goal: Be SPICE to the MEMS world  Provide quick simulations for tight design loop  Use.

Simulating MEMS David Bindel April 11, Overview What are MEMS? Modeling and simulation The SUGAR simulator Ongoing work Conclusion.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

BeBOP: Berkeley Benchmarking and Optimization Automatic Performance Tuning of Numerical Kernels James Demmel EECS and Math UC Berkeley Katherine Yelick.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

© 2012 Autodesk A Fast Modal (Eigenvalue) Solver Based on Subspace and AMG Sam MurgieJames Herzing Research ManagerSimulation Evangelist.

Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Optimizing the Performance of Sparse Matrix-Vector Multiplication

William Stallings Computer Organization and Architecture 6th Edition

University of California, Berkeley

Programming Models for SimMillennium

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

for more information ... Performance Tuning

Centar ( Global Signal Processing Expo

Performance Optimization for Embedded Software

STUDY AND IMPLEMENTATION

Presentation transcript:

High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley Joint work with David Culler, Michael Jordan, William Kahan, Katherine Yelick, Zhaojun Bai (UC Davis)

Outline  What’s up with Millennium  Automatic performance tuning  Applications to  SUGAR – a MEMS CAD system  Information Retrieval  Future Work

Millennium  Cluster of clusters at UC Berkeley  309 CPU cluster in Soda Hall  Smaller clusters across campus  Made possible by Intel equipment grant  Significant other support  NSF, Sun, Microsoft, Nortel, campus 

Millennium Topology

Millennium Usage Oct 1 – 11, % utilization for last few days About half the jobs are parallel

Usage highlights  AMANDA  Antarctic Muon And Neutrino Detector Array  amanda.berkeley.edu amanda.berkeley.edu  128 scientists from 15 universities and institutes in the U.S. and Europe.  TEMPEST  EUV lithography simulations via 3D electromagnetic scattering  cuervo.eecs.berkeley.edu/Volcano/ cuervo.eecs.berkeley.edu/Volcano/  study the defect printability on multilayer masks  Titanium  High performance Java dialect for scientific computing   Implementation of shared address space, and use of SSE2  Digital Library Project  Large database of images  elib.cs.berkeley.edu/ elib.cs.berkeley.edu/  Used to run spectral image segmentation algorithm for clustering, search on images

Usage highlights (continued)  CS 267  Graduate class in parallel computing, 33 enrolled   Homework  Disaster Response  Help find people after Sept 11, set up immediately afterwards  safe.millennium.berkeley.edu safe.millennium.berkeley.edu  48K reports in database, linked to other survivor databases  MEMS CAD (MicroElectroMechanical Systems Computer Aided Design)  Tool to help design MEMS systems  Used this semester in EE 245, 93 enrolled  sugar.millennium.berkeley.edu sugar.millennium.berkeley.edu  More later in talk  Information Retrieval  Development of faster information retrieval algorithms   More later in talk  Many applications are part of CITRIS

Performance Tuning Performance Tuning  Motivation: performance of many applications dominated by a few kernels  MEMS CAD  Nonlinear ODEs  Nonlinear equations  Linear equations  Matrix multiply  Matrix-by-matrix or matrix-by-vector  Dense or Sparse  Information retrieval by LSI  Compress term- document matrix  …  Sparse mat-vec multiply  Information retrieval by LDA  Maximum likelihood estimation  …  Solve linear systems  Many other examples (not all linear algebra)

Conventional Performance Tuning  Vendor or user hand tunes kernels  Drawbacks:  Very time consuming and difficult work  Even with intimate knowledge of architecture and compiler, performance hard to predict  Must be redone for every architecture, compiler  Compiler technology often lags architecture  Not just a compiler problem:  Best algorithm may depend on input, so some tuning must occur at run-time.  Not all algorithms semantically or mathematically equivalent

Automatic Performance Tuning  Approach: for each kernel 1.Identify and generate a space of algorithms 2.Search for the fastest one, by running them  What is a space of algorithms?  Depends on kernel and input  May vary  instruction mix and order  memory access patterns  data structures  mathematical formulation  When do we search?  Once per kernel and architecture  At compile time  At run time

Tuning pays off – PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)

Tuning pays off – ATLAS (Dongarra, Whaley) Extends applicability of PHIPAC Incorporated in Matlab (with rest of LAPACK)

Other Automatic Tuning Projects  FFTs and Signal Processing  FFTW (  Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of n  Widely used, won 1999 Wilkinson Prize for Numerical Software  SPIRAL (  Extensions to other transforms, DSPs  UHFFT  Extensions to higher dimension, parallelism  Special session at ICCS 2001  Organized by Yelick and Demmel   Proceedings available  Pointers to automatic tuning projects at

Search for optimal register tile sizes on Sun Ultra registers, but 2-by-3 tile size fastest

Search for Optimal L0 block size in dense matmul 60% of peak on Pentium II-300 4% of versions exceed

High precision dense mat-vec multiply

Tuning Sparse matrix operations  Sparsity  Optimizes y = A*x for a particular sparse A  Im and Yelick  Algorithm space  Different code organization, instruction mixes  Different register blockings (change data structure and fill of A)  Different cache blocking  Different number of columns of x  Different matrix orderings  Software and papers available 

Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 RHS

Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 RHS

Sparsity reg blocking results on P4 for FEM/fluids matrix 1

Sparsity reg blocking results on P4 for FEM/fluids matrix 2

Sparsity cache blocking results on P4 for LSI

Tuning other sparse operations  Symmetric matrix-vector multiply A*x  Solve a triangular system of equations T -1 *x  A T *A*x  Kernel of Information Retrieval via LSI  Same number of memory references as A*x  A 2 *x, A k *x  Kernel of Information Retrieval used by Google  Changes calling algorithm  A T *M*A  Matrix triple product  Used in multigrid solver  …

Symmetric Sparse Matrix-Vector Multiply on P4

Sparse Triangular Solve on P4

Applications to SUGAR – a tool for MEMS CAD  Demmel, Bai, Pister, Govindjee, Agogino, Gu, … DemmelBaiPisterGovindjeeAgoginoGu  Input: description of MicroElectroMechanical System (as netlist)  Output:  DC, steady state, modal, transient analyses to assess behavior  CIF for fabrication  Simulation capabilities  Beams and plates (linear, nonlinear, prestressed,…)  Electrostatic forces, circuits  Thermal expansion, Couette damping  Availability  Matlab  Publicly available  www-bsac.eecs.berkeley.edu/~cfm www-bsac.eecs.berkeley.edu/~cfm  249 registered users, many unregistered  Web service – M & MEMS  Runs on Millennium  sugar.millennium.berkeley.edu sugar.millennium.berkeley.edu  Now in use in EE 245 at UCB…96 users  Lots of new features being added, including interface to measurements

Micromirror (Last, Pister)  Laterally actuated torsionally suspended micromirror  Over 10K dof, 100 line netlist (using subnets)  DC and frequency analysis  All algorithms reduce to previous kernels

Information Retrieval  Jordan Jordan  Collaboration with Intel team building probabilistic graphical models  Better alternatives to LSI for document modeling and search  Latent Dirichlet Allocation (LDA)  Model documents as union of themes, each with own word distribution  Maximum likelihood fit to find themes in set of documents, classify them  Computational bottleneck is solution of enormous linear systems  One of largest Millennium users  Kernel ICA  Estimate set of sources s and mixing matrix A from samples x = A*s  New way to sample such that sources are as independent as possible  Again reduces to linear algebra kernels  Identifying influential documents  Given hyperlink patterns of documents, which are most influential?  Basis of Google (eigenvector of link matrix  sparse matrix vector multiply)  Applying Markov chain and perturbation theory to assess reliability

Future Work  Exploit Itanium Architecture  128 (82-bit) floating pointer registers  fused multiply-add instruction  predicated instructions  rotating registers for software pipelining  prefetching instructions  three levels of cache  Tune current and wider set of kernels  Incorporate into  SUGAR  Information Retrieval  Further automate performance tuning  Generation of algorithm space generators