O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

U NDERSUBSCRIBED T HREADING ON C LUSTERED C ACHE A RCHITECTURES W IM H EIRMAN 1,2 T REVOR E. C ARLSON 1 K ENZO V AN C RAEYNEST 1 I BRAHIM H UR 2 A AMER.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.

OpenFOAM on a GPU-based Heterogeneous Cluster

A SYSTEM PERFORMANCE MODEL CSC 8320 Advanced Operating Systems Georgia State University Yuan Long.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Computer Performance Computer Engineering Department.

Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Understanding Parallel Computers Parallel Processing EE 613.

Sunpyo Hong, Hyesoon Kim

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Measuring Performance II and Logic Design

Community Grids Laboratory

Brad Baker, Wayne Haney, Dr. Charles Choi

Ioannis E. Venetis Department of Computer Engineering and Informatics

From Algorithm to System to Cloud Computing

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

BLIS optimized for EPYCTM Processors

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

Linchuan Chen, Peng Jiang and Gagan Agrawal

Adaptive Strassen and ATLAS’s DGEMM

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Many-Core Graph Workload Analysis

Multi-Core Programming Assignment

Multicore and GPU Programming

Multicore and GPU Programming

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer

T HE P ROBLEM LU is a common matrix operation with a broad range of applications Writes matrix as a product of L and U Example: PA= LU a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a l l 21 l 22 0 l 31 l 32 l 33 u11u11 u12u12 u13u13 0 u22u22 u23u23 00 u33u33

T HE P ROBLEM

Small parallelism Big parallelism

O UTLINE Overview  Results Conclusion

O VERVIEW Four implementations of LU PLASMA (highly optimized third party library) Sivan Toledo’s algorithm in Cilk++ (courtesy of Bradley) Parallel standard “right-looking” in Cilk++ Right-looking in pthreads All implementations use same base case GotoBLAS2 matrix routines Analyze performance Machine architecture Cache behavior

O UTLINE Overview Results Summary  Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion

M ETHODOLOGY Machine configurations: AMD16 : Quad-quad AMD Opteron 2.0 GHz Intel16 : Quad-quad Intel Xeon 2.0 GHz Intel8: Dual-quad Intel Xeon 2.4 GHz Xen indicates running a Xen-enabled kernel All tests still ran in dom0 (outside virtual machine)

P ERFORMANCE S UMMARY Quite significant performance heterogeneity by machine architecture Large impact from caches LU performace (gflops on 4k x 4k, 8 cores) AMD16Intel16Intel16XenIntel8Xen PLASMA Toledo Right Pthread

LU S CALING

O UTLINE Overview Results Summary Architectural heterogeneity  Cache effects Parallelism Scheduling Code size Conclusion

A RCHITECTURAL V ARIATION ( BY ARCH.) AMD16Intel16 Intel8Xen

A RCHITECTURAL V ARIATION ( BY ALG ’ THM )

X EN I NTERFERENCE Strange behavior with increasing core count on Intel16 Intel16Xen   Intel16

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects  Parallelism Scheduling Code size Conclusion

C ACHE I NTERFERENCE Noticed scaling problem with Toledo algorithm Tested with matrices of size 2 n Caused conflict misses in processor cache

C ACHE I NTERFERENCE : EXAMPLE AMD Opteron has 64 byte cache lines and a 64 Kbyte 2-way set associative cache: 512 sets, 2 cache lines each Every 32Kbyte (or 4096 doubles) map to the same set offset set tag

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

C ACHE I NTERFERENCE ( GRAPHS ) Before: After:

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism  Scheduling Code size Conclusion

P ARALLELISM Toledo shows higher parallelism, particularly in burdened parallelism and large matrices Still doesn’t explain poor scaling of right at low numbers of cores Matrix SizeToledoRight-looking ParallelismBurdened Parallelism ParallelismBurdened Parallelism 2048x x x

S YSTEM F ACTORS (L OAD LATENCY ) Performance of Right relative to Toledo

S YSTEM F ACTORS (L OAD LATENCY ) Performance of Tile relative to Toledo

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling  Code size Conclusion

S CHEDULING Cilk++ provides dynamic scheduler PLASMA, pthread use static schedule Compare performance under multiprogrammed workload

S CHEDULING G RAPH Cilk++ implementations degrade more gracefully PLASMA does OK; pthread right (“tile”) doesn’t

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size  Conclusion

C ODE STYLE * Includes base case wrappers Comparing different languages Expected large difference, but they are similar Complexity is in base case Base cases are shared Lines of Code ToledoRight- looking PLASMAPthread Right Just LU Everything *

C ONCLUSION Cilk++ can perform competitively with optimized math libraries Cache behavior is most important factor Cilk++ shows better performance degradation with other things running Especially compared to hand-coded pthread versions Code size not a major factor