Claude Tadonki Mines ParisTech – CRI – Mathématiques et Systèmes Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS France 2nd.

Slides:



Advertisements
Similar presentations
Accelerator-based Implementation of the Harris Algorithm International Conference on Image and Signal Processing 2012 (ICISP 2012) June 28-30, Agadir,
Advertisements

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.
Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding K. Ishizaka, M. Obata, H. Kasahara Waseda University, Tokyo, Japan.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Introduction CS 524 – High-Performance Computing.
October, 1998DARPA / Melamed / Singh1 Parallelization of Search Algorithms for Modeling QTES Processes Joshua Kramer and Santokh Singh Rutgers University.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Dr. Gheith Abandah, Chair Computer Engineering Department The University of Jordan 20/4/20091.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Automatic LQCD Code Generation Quatrièmes Rencontres de la Communauté Française de Compilation December 5-7, 2011 Saint-Hippolyte (France) Claude Tadonki.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Strengthening deflation implementation for large scale LQCD inversions Claude Tadonki Mines ParisTech / LAL-CNRS-IN2P3 Review Meeting / PetaQCD LAL / Paris-Sud.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
(MRC) 2 These slides are not approved for public release Resilient high-dimensional datacenter 1 Control Plane: Controllers and Switches.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Chapter 5(5.4~5.5) Applying information retrieval to text mining Parallel embedded system design lab 이청용.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
COMPUTATIONAL MODELS.
Chilimbi, et al. (2014) Microsoft Research
for the Offline and Computing groups
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
CS 584 Lecture 3 How is the assignment going?.
CSCI206 - Computer Organization & Programming
Introduction to Parallelism.
Adaptive Strassen and ATLAS’s DGEMM
Parallel Computation of 2D Morse-Smale Complexes
Numerical Algorithms Quiz questions
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Hybrid Programming with OpenMP and MPI
Professor Ioana Banicescu CSE 8843
EE 4xx: Computer Architecture and Performance Programming
Claude Tadonki Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS
Database System Architectures
Chip&Core Architecture
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
ACHIEVEMENT DESCRIPTION
Focus on Parallel Combinatorial Optimization
Presentation transcript:

Claude Tadonki Mines ParisTech – CRI – Mathématiques et Systèmes Laboratoire de l’Accélérateur Linéaire/IN2P3/CNRS France 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil.

Large Scale Kronecker Product on Supercomputers C. TADONKI The Kronecker product (définition and applications) 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil.

Large Scale Kronecker Product on Supercomputers C. TADONKI The Kronecker product (properties and problem formulation) 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil.

Large Scale Kronecker Product on Supercomputers C. TADONKI The Kronecker (complexity and recurrence equation) 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil. Forming the matrix first would require a huge amount of memory yield lot of redundant multiplication, which in total would be Using the so-called normal factorization, we could derive an optimal scheme which reduces the number of floatting point multiplication to

Large Scale Kronecker Product on Supercomputers C. TADONKI The Kronecker product and its applications 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil.

Large Scale Kronecker Product on Supercomputers C. TADONKI Performance issues and heuristic for finding a good topology 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil. The total (parallel) execution time depends on the sizes of the matrices the gap between virtual topology and physical topology the way the task is splitted among the processors (decomposition)

Large Scale Kronecker Product on Supercomputers C. TADONKI Performances 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil. We consider N = 6 matrices of orders 30, 36, 32, 18, 24, 16, thus L = We see that our heuristic yields a significant improvment compare to trivial decompositions we start loosing the scalabily when the number of cores increases (com) We the turn to hybrid implementation

Large Scale Kronecker Product on Supercomputers C. TADONKI Performance of the hybrid implementation 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil. We see that the hybrid implementation is better for larger number of cores for smaller number of cores, the SM implemntation exacerbates on cache misses Need to investigate on the compromise and a better memory layout.

END & QUESTIONS 2nd Workshop on Architecture and Multi-Core Applications 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD 2011) October, 26 – , Vitória, Espírito Santo, Brazil. Large Scale Kronecker Product on Supercomputers C. TADONKI