Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Using Matrices in Real Life

Advertisements

© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.

Analysis of Computer Algorithms

Generative Design in Civil Engineering Using Cellular Automata Rafal Kicinger June 16, 2006.

Introduction to C Programming

The 2005 UK Workshop on Computational Intelligence 5-7 September 2005, London L2-SVM Based Fuzzy Classifier with Automatic Model Selection and Fuzzy Rule.

Chapter 6 File Systems 6.1 Files 6.2 Directories

On Sequential Experimental Design for Empirical Model-Building under Interval Error Sergei Zhilin, Altai State University, Barnaul, Russia.

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

RISC Instruction Pipelines and Register Windows

Incremental Clustering for Trajectories

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez.

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

SE-292 High Performance Computing

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.

SPiiPlus Feedback Tuning

1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland

Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

Database Performance Tuning and Query Optimization

Chapter 4 Memory Management Basic memory management Swapping

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

Online Algorithm Huaping Wang Apr.21

Cache and Virtual Memory Replacement Algorithms

Chapter 10: Virtual Memory

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

Geometric Sequences Teacher Notes

Processes Management.

Unit 3A Multiple Choice Review!

A Synergetic Approach to Throughput Computing on IA Chi-Keung (CK) Luk TPI/DPD/SSG Intel Corporation Nov 16, 2010.

Addition 1’s to 20.

Test B, 100 Subtraction Facts

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

SE-292 High Performance Computing

Chapter 18: The Chi-Square Statistic

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Design Space Exploration with SimpleScalar. Vittorio Zaccaria – ST 2001 The SimpleScalar Toolset.

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Classification Classification Examples

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Relevance Feedback-Based Image Retrieval Interface Incorporating Region and Feature Saliency Patterns as Visualizable Image Similarity Criteria Speaker.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Efficient Model Selection for Support Vector Machines

Theory and Applications of GF(2 p ) Cellular Automata P. Pal Chaudhuri Department of CST Bengal Engineering College (DU) Shibpur, Howrah India (LOGIC ON.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

L18: CUDA, cont. Memory Hierarchy and Examples

Adaptive Strassen and ATLAS’s DGEMM

A Comparison of Cache-conscious and Cache-oblivious Codes

Presentation transcript:

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

Tuning library for recursive matrix multiplication Use cache-aware algorithms that take into account architectural features –Memory hierarchy –Register file, … Take into account input characteristics –matrix sizes The process of tuning is automatic.

Recursive Matrix Partitioning Previous approaches –Multiple recursive steps –Only divide by half A B

Recursive Matrix Partitioning Previous approaches: –Multiple recursive steps –Only divide by half A B Step 1:

Recursive Matrix Partitioning Previous approaches: –Multiple recursive steps –Only divide by half A B Step 2:

Recursive Matrix Partitioning Our approach is more general –No need to divide by half –May use a single step to reach the same partition –Faster and more general A B Step 1:

Our approach A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine: –Number of partition levels –How to partition at each level An intelligent search method based on a classifier learning system –Search for the best partitioning strategy in a huge search space

Outline Background Partition Methods Classifier Learning System Experimental Results

Recursive layout framework Multiple levels of recursion –Takes into account the cache hierarchy

Recursive layout framework Multiple levels of recursion –Takes into account the cache hierarchy

Recursive layout in our framework Multiple levels of recursion –Takes into account the cache hierarchy

Recursive layout framework Multiple levels of recursion –Takes into account the cache hierarchy

Recursive layout framework Multiple levels of recursion –Takes into account the cache hierarchy

Padding Necessary when the partition factor is not a divisor of the matrix dimension Divide by 3

Padding Necessary when the partition factor is not a divisor of the matrix dimension Divide by 3 667

Padding Necessary when the partition factor is not a divisor of the matrix dimension Divide by 4 667

Padding Necessary when the partition factor is not a divisor of the matrix dimension Divide by 4 668

Recursive layout in our framework Multiple level recursion –Support cache hierarchy Square tile rectangular tile –Fit non-square matrixes

Recursive layout in our framework Multiple level recursion –Support cache hierarchy Square tile rectangular tile –Fit non-square matrixes 9 8

Recursive layout in our framework Multiple level recursion –Support cache hierarchy Square tile rectangular tile –Fit non-square matrixes 10 8 Padding

Recursive layout in our framework Multiple level recursion –Support cache hierarchy Square tile rectangular tile –Fit non-square matrixes 3 4

Outline Background Partition Methods Classifier Learning System Experimental Results

Partition by Block (PB) –Specify the size of each tile –Example: Dimensions (M,N,K) = (100, 100, 40) Tile size (bm, bn, bk) = (50, 50, 20) Partition factors (pm, pn, pk) = (2,2,2) –Tiles need not to be square Two methods to partition matrices

Partition by Size (PS) –Specify the maximum size of the three tiles. –Maintain the ratios between dimensions constant –Example: (M,N,K) = (100, 100,50) Maximum tile size for M,N = 1250 (pm, pn, pk) = (2,2,1) –Generalization of the divide-by-half approach. Tile size = 1/4 * matrix size

Outline Background Partition Methods Classifier Learning System Experimental Results

Classifier Learning System Use the two partition primitives to determine how the input matrices are partitioned –Determine partition factors at each level f: (M,N,K) (pm i,pn i,pk i ), i=0,1,2 (only consider 3 levels) The partition factors depend on the matrix size –Eg. The partitions factors of a (1000 x 1000) matrix should be different that those of a (50 x 1000) matrix. The partition factors also depend on the architectural characteristics, like cache size.

Determine the best partition factors The search space is huge exhaustive search is impossible Our proposal: use a multi-step classifier learning system –Creates a table that given the matrix dimensions determines the partition factors

Classifier Learning System The result of the classifier learning system is a table with two columns Column 1 (Pattern): A string of 0, 1, and * that encodes the dimensions of the matrices Column 2 (Action): Partition method for one step –Built using the partition-by-block and partition-by- size primitives with different parameters.

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4)

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 5 bits / dim

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 16 24

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 16 24

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 8 12

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 8 12

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 8 12

Learn with Classifier System PatternAction (10***,11***)PS 100 …… (010**,011**)PB (4,4) 4 4

How classifier learning algorithm works? Change the table based on the feedback of performance and accuracy from previous runs. Mutate the condition part of the table to adjust the range of matching matrix dimensions. Mutate the action part to find the best partition method for the matching matrices.

Outline Background Partition Methods Classifier Learning System Experimental Results

Experiments on three platforms –Sun UltraSparcIII –P4 Intel Xeon –Intel Itanium2 Matrices of sizes from 1000 x 1000 to 5000 x 5000

Algorithms Classifier MMM: our approach –Include the overhead of copying in and out of recursive layout ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. –Has some type of blocking for L2 L1: One level of tiling – tile size: the same that ATLAS for L1 L2: Two levels of tiling –L1tile and L2tile: the same that ATLAS for L1

Conclusion and Future Work Preliminary results prove the effectiveness of our approach –Sun UltraSparcIII and Xeon: 18% and 5% improvement, respectively. –Itanium: -14% Need to improve padding mechanism –Reduce the amount of padding –Avoid unnecessary computation on padding

Thank you!