How Efficient Can We Be?: Bounds on Algorithm Energy Consumption

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

Zhen Lu CPACT University of Newcastle MDC Technology Reduced Hessian Sequential Quadratic Programming(SQP)
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Slide# Ketter Hall, North Campus, Buffalo, NY Fax: Tel: x 2400 Control of Structural Vibrations.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Chapter 4. Numerical Interpretation of Eigenvalues In terms of matrix arithmetic eigenvalues turn matrix multiplication into scalar multiplication. Numerically.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Emergy Berger, Calvin Lin, Samuel Z. Guyer CSE 231 Presentation by Jennifer Grau.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 11.
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
Nonlinear Adaptive Kernel Methods Dec. 1, 2009 Anthony Kuh Chaopin Zhu Nate Kowahl.
Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
CORDIC Based 64-Point Radix-2 FFT Processor
SketchVisor: Robust Network Measurement for Software Packet Processing
Stanford University.
World’s fastest Machine Learning Engine
Cloud-Based Process Planning for CNC Code Generation
TI Information – Selective Disclosure
Benchmarking Deep Learning Inference
COMPUTATIONAL MODELS.
Computing and Compressive Sensing in Wireless Sensor Networks
GC 211:Data Structures Week 2: Algorithm Analysis Tools
Edinburgh Napier University
GC 211:Data Structures Algorithm Analysis Tools
Online parameter optimization for elastic data stream processing
A Common Machine Language for Communication-Exposed Architectures
Amir Kamil and Katherine Yelick
Aesun Park1 , Kyung-Ah Shim2*, Namhun Koo2, and Dong-Guk Han1
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Lecture 16: Parallel Algorithms I
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Digital Processing Platform
Communication costs of Schönhage-Strassen fast integer multiplication
Motivation EECS 20 Lecture 1 (January 17, 2001) Tom Henzinger.
GC 211:Data Structures Algorithm Analysis Tools
CS/EE 217 – GPU Architecture and Parallel Programming
The Communication Complexity of Distributed Set-Joins
Numerical Algorithms Quiz questions
© 2012 Elsevier, Inc. All rights reserved.
COMP60621 Fundamentals of Parallel and Distributed Systems
Professor Ioana Banicescu CSE 8843
Final Project presentation
EE 4xx: Computer Architecture and Performance Programming
1CECA, Peking University, China
Amir Kamil and Katherine Yelick
Automatic optimization of parallel linear algebra software
MAT 2401 Linear Algebra 7.2 Diagonalization
Linear Algebra review (optional)
Chapter 11 Chapter 11.
COMP60611 Fundamentals of Parallel and Distributed Systems
 Is a machine that is able to take information (input), do some work on (process), and to make new information (output) COMPUTER.
Presentation transcript:

How Efficient Can We Be?: Bounds on Algorithm Energy Consumption Andrew Gearhart

Relation to ASPIRE ASPIRE (“Algorithms and Specializers for Provably-optimal Implementations with Resiliency and Efficiency”) -> recall Krste’s talk “Provably-optimal” is the focus of this work Software and hardware design use feedback to “cotune” compute kernel energy efficiency

Previous Work: Communication Lower Bounds Bounds on bandwidth and number of messages for most of dense and sparse linear algebra Bounds for homo and heterogeneous machine models (i.e. GPU/CPU) Led to the development of many “communication-optimal” algorithms

Communication is energy inefficient! On-chip/Off-chip gap isn’t going to improve much Data from John Shalf, LBNL

Communication is energy inefficient! Communication lower bounds + machine models = lower bounds on energy Machine models can be simple:

Communication is energy inefficient! Communication lower bounds + machine models = lower bounds on energy Machine models can be simple: Or more complicated…

Runtime and Energy Models Currently simple linear expressions that include bandwidth (W) and per transfer (S) costs -> link to communication bounds

Runtime and Energy Models Currently simple linear expressions that include bandwidth (W) and per transfer (S) costs -> link to communication bounds Models are based on a set of hardware and algorithmic parameters:

Example: 2.5D Matrix-Matrix Multiplication (GEMM) 2.5D algorithm calculates GEMM by replicating input data to reduce communication

Energy Bounds Have energy bounds for GEMM, matrix-vector multiplication, Strassen’s matrix multiplication, FFT, n-body, LU What are energy-optimal machine parameters for a given problem? ASPIRE Open House on the 5th floor of Soda Hall!!!!