How Efficient Can We Be?: Bounds on Algorithm Energy Consumption

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Zhen Lu CPACT University of Newcastle MDC Technology Reduced Hessian Sequential Quadratic Programming(SQP)

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Slide# Ketter Hall, North Campus, Buffalo, NY Fax: Tel: x 2400 Control of Structural Vibrations.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Chapter 4. Numerical Interpretation of Eigenvalues In terms of matrix arithmetic eigenvalues turn matrix multiplication into scalar multiplication. Numerically.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.

Emergy Berger, Calvin Lin, Samuel Z. Guyer CSE 231 Presentation by Jennifer Grau.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 11.

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Nonlinear Adaptive Kernel Methods Dec. 1, 2009 Anthony Kuh Chaopin Zhu Nate Kowahl.

Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

CORDIC Based 64-Point Radix-2 FFT Processor

SketchVisor: Robust Network Measurement for Software Packet Processing

Stanford University.

World’s fastest Machine Learning Engine

Cloud-Based Process Planning for CNC Code Generation

TI Information – Selective Disclosure

Benchmarking Deep Learning Inference

COMPUTATIONAL MODELS.

Computing and Compressive Sensing in Wireless Sensor Networks

GC 211:Data Structures Week 2: Algorithm Analysis Tools

Edinburgh Napier University

GC 211:Data Structures Algorithm Analysis Tools

Online parameter optimization for elastic data stream processing

A Common Machine Language for Communication-Exposed Architectures

Amir Kamil and Katherine Yelick

Aesun Park1 , Kyung-Ah Shim2*, Namhun Koo2, and Dong-Guk Han1

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Lecture 16: Parallel Algorithms I

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Digital Processing Platform

Communication costs of Schönhage-Strassen fast integer multiplication

Motivation EECS 20 Lecture 1 (January 17, 2001) Tom Henzinger.

GC 211:Data Structures Algorithm Analysis Tools

CS/EE 217 – GPU Architecture and Parallel Programming

The Communication Complexity of Distributed Set-Joins

Numerical Algorithms Quiz questions

© 2012 Elsevier, Inc. All rights reserved.

COMP60621 Fundamentals of Parallel and Distributed Systems

Professor Ioana Banicescu CSE 8843

Final Project presentation

EE 4xx: Computer Architecture and Performance Programming

1CECA, Peking University, China

Amir Kamil and Katherine Yelick

Automatic optimization of parallel linear algebra software

MAT 2401 Linear Algebra 7.2 Diagonalization

Linear Algebra review (optional)

Chapter 11 Chapter 11.

COMP60611 Fundamentals of Parallel and Distributed Systems

 Is a machine that is able to take information (input), do some work on (process), and to make new information (output) COMPUTER.

Presentation transcript:

How Efficient Can We Be?: Bounds on Algorithm Energy Consumption Andrew Gearhart

Relation to ASPIRE ASPIRE (“Algorithms and Specializers for Provably-optimal Implementations with Resiliency and Efficiency”) -> recall Krste’s talk “Provably-optimal” is the focus of this work Software and hardware design use feedback to “cotune” compute kernel energy efficiency

Previous Work: Communication Lower Bounds Bounds on bandwidth and number of messages for most of dense and sparse linear algebra Bounds for homo and heterogeneous machine models (i.e. GPU/CPU) Led to the development of many “communication-optimal” algorithms

Communication is energy inefficient! On-chip/Off-chip gap isn’t going to improve much Data from John Shalf, LBNL

Communication is energy inefficient! Communication lower bounds + machine models = lower bounds on energy Machine models can be simple:

Communication is energy inefficient! Communication lower bounds + machine models = lower bounds on energy Machine models can be simple: Or more complicated…

Runtime and Energy Models Currently simple linear expressions that include bandwidth (W) and per transfer (S) costs -> link to communication bounds

Runtime and Energy Models Currently simple linear expressions that include bandwidth (W) and per transfer (S) costs -> link to communication bounds Models are based on a set of hardware and algorithmic parameters:

Example: 2.5D Matrix-Matrix Multiplication (GEMM) 2.5D algorithm calculates GEMM by replicating input data to reduce communication

Energy Bounds Have energy bounds for GEMM, matrix-vector multiplication, Strassen’s matrix multiplication, FFT, n-body, LU What are energy-optimal machine parameters for a given problem? ASPIRE Open House on the 5th floor of Soda Hall!!!!