Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
Mafijul Islam, PhD Software Systems, Electrical and Embedded Systems Advanced Technology & Research Research Issues in Computing Systems: An Automotive.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Thoughts on Shared Caches Jeff Odom University of Maryland.
Introductory Courses in High Performance Computing at Illinois David Padua.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Course Instructor: Aisha Azeem
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Assessment of Core Services provided to USLHC by OSG.
Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Multi-core architectures. Single-core computer Single-core CPU chip.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 Direction scientifique Networks of Excellence objectives  Reinforce or strengthen scientific and technological excellence on a given research topic.
Moderator: John Mellor-Crummey Department of Computer Science Rice University Programming Languages/Models and Compiler Technologies Microsoft Manycore.
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
1 EECS 6083 Compiler Theory Based on slides from text web site: Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Applications and Requirements for Scientific Workflow Introduction May NSF Geoffrey Fox Indiana University.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Programmability Hiroshi Nakashima Thomas Sterling.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Computer Engg, IIT(BHU)
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
For Massively Parallel Computation The Chaotic State of the Art
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Many-core Software Development Platforms
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University

Center for High Performance Software Research CRMC Overview Initial Participation from Three Institutions —Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner —Indiana: Geoffrey Fox, Dennis Gannon —Tennessee: Jack Dongarra Activities —Research and prototype development —Community building –Workshops and meetings —Other outreach components (separately funded) Planning and Management —Coordinated management and vision-building –Model: CRPC

Center for High Performance Software Research Management Strategy Pioneered in CRPC and honed in GrADS/VGrADS/LACSI Leadership forms a broad vision-building team —Problem identification —Willingness to redirect research to address new challenges —Complementary research areas –Willingness to look at problems from multiple dimensions —Joint projects between sites Community-building activities —Focused workshops on key topics –CRPC: TSPLib, BLACS/ScaLAPACK –LACSI: Autotuning —Informal standardization –CRPC: MPI and HPF Annual planning cycle —Plan, research, report, evaluate,…

Center for High Performance Software Research Research Areas I Compilers and programming tools —Tools: Performance analysis and prediction (HPCToolkit) —Transformations: memory hierarchy and parallelism —Automatic tuning strategies Programming models and languages —High-level languages: Matlab, Python, R, etc —HPCS languages —Programming models based on component integration Run-time systems —Core run-time data movement library —Integration with MPI Libraries —Adaptive, reconfigurable libraries optimized for multicore systems

Center for High Performance Software Research Research Areas II Applications for multicore systems —Classical parallel/scientific applications —Commercial applications with advice from industrial partners Interface between software and architecture —Facilities for managing bandwidth (controllable caches, scratch memory) —Sample-based profiling facilities —Heterogeneous cores Fault tolerance —Redundant components —Diskless checkpointing Multicore emulator —Research platform for future systems

Center for High Performance Software Research Performance Analysis and Prediction HPC Toolkit (Mellor-Crummey) —Uses sample-based profiling combined with binary analysis to report performance issues (recompilation not required) –How to extend to multicore environment? Performance prediction (Mellor-Crummey) —Currently using a performance prediction methodology that accurately accounts for memory hierarchy –Reuse-distance histograms based on training data, parameterized by input data size –Accurately determines frequency of miss at each reference —Extension to shared-cache multicore systems (underway)

Center for High Performance Software Research Bandwidth Management Multicore raises computational power rapidly —Bandwidth onto chip unlikely to keep up Multicore systems will feature shared caches —Replaces false sharing with enhanced probability of conflict misses Challenges for effective use of bandwidth —Enhancing reuse when multiple processors are using cache —Reorganizing data to increase density of cache block use —Reorganizing computation to ensure reuse of data by multiple cores –Inter-core pipelining —Managing conflict misses –With and without architectural help Without architectural help —Data reorganization within pages and synchronization to minimize conflict misses –May require special memory allocation run-time primitives

Center for High Performance Software Research Conflict Misses Unfortunate fact: —If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and —All k strips overlap in one associativity group, then –Every access to the overlap group location is a miss On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1 In a 2-way associative cache, all are misses! This limits loop fusion, a profitable reuse strategy

Center for High Performance Software Research Controlling Conflicts: An Example Cache and Page Parameters —256K Cache, 4-way set associative, 32-byte blocks –1024 associativity groups —64K Page –2048 cache blocks —Each block in a page maps to a unique associativity group –2 different lines in a page map to the same associativity group In General —Let A = number of associativity groups in cache —Let P = number of cache blocks in a page —If P ≥ A then each block in a page maps to a single associativity group –No matter where the page is loaded —If P < A then a block can map to A/P different associativity groups –Depending on where the page is loaded

Center for High Performance Software Research Questions Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation? —Extensive work on minimizing self-conflict misses —Little work on inter-array conflict minimization —No work, to my knowledge, on interprocessor conflict minimization Can we synchronize computations so that multiple cores do not interfere with one another? —Even reuse blocks across processors Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses —Allocation of part of cache as a scratchpad —Dynamic modification of cache mapping

Center for High Performance Software Research Parallelism On a shared-cache multicore chip, running the same program using multiple processors has a major advantage —Possibility for reuse of cache blocks across processors —Some chance for controlling conflict misses How can parallelism be found and exploited? —Automatic methods on scientific languages –Much progress was made in the 90s —Explicit parallel programming and thread management paradigms –Data parallel (HPF, Chapel) –Partitioned global address space (Co-Array Fortran, UPC) –Lightweight threading (OpenMP,CCR) —Software synchronization primitives —Integration of parallel component libraries –Telescoping languages –Parallel Matlab

Center for High Performance Software Research Automatic Tuning Following ATLAS Tuning generalized component libraries in advance —For different platforms —For different contexts in the same platform –May wish to chose a variant that uses a subset of the cache that does not conflict with the calling program Extensive work at Rice and Tennessee —Heuristic search combined with compiler models cut tuning time —Many transformations: unroll-and-jam, tiling, fusion, etc. –Interact with one another New challenges for multicore —Tuning of on-chip multiprocessors to use shared (and non-shared) memory hierarchy effectively —Management of on-chip parallelism and threading

Center for High Performance Software Research Other Compiler Challenges Multicore chips used in scalable parallel machines —Multiple kinds of parallelism: on-chip, within an SMP group, distributed memory Heterogeneous multicore chips (Grid on a chip) —In the Intel roadmap —Challenge: decomposing computations to match strengths of different cores –Static and dynamic strategies may be required –Performance models for subcomputations on different cores –Interaction of heterogeneity and memory hierarchy –Staging computations through shared cache Workflow steps running on different cores Component-composition programming environments —Graphical or construction from scripts

Center for High Performance Software Research Component Library Component Library Application Translator Application Translator Optimized Application Optimized Application Vendor Compiler Vendor Compiler Optimizer Generator Optimizer Generator Could run for hours Application Optimizer Application Optimizer Understands library calls as primitives Telescoping Languages Scripting language or standard language, (Fortran or C++)

Center for High Performance Software Research Compiler Infrastructure D System Infrastructure —Includes full dependence analysis —Support for high-level transformations –Register, cache, fusion —Support for parallelism and communication management –Originally used for HPF Telescoping Languages Infrastructure —Constructed for Matlab compilation and component integration —Constraint-based type analysis –Produces type-jump functions for libraries —Variant specialization and selection —Applied to parallel Matlab project Both currently distributed under BSD-style license (no GPL) Open64 Compiler Infrastructure —GPL License

Center for High Performance Software Research Proposal An NSF Center for Research on Multicore Computing —Modeled after CRPC –Core research program –Multiple participating institutions —Research –Compilers and tools –Architectural modifications, supported by simulation –Run-time systems and communication/synchronization –Driven by real applications from NSF community —Big community outreach program –Specific topical workshops —Major investment from Intel Coupled with Multicore Computing Research Program —Designed to foster a vibrant community of researchers

Center for High Performance Software Research Leverage DOE SciDAC Projects —Currently proposed: A major Enabling Technology Center –Kennedy: CScADS (includes infrastructure development) —Participants in several other relevant SciDAC efforts –PERC2, PModels LACSI Projects —Subject to ASC budget Chip Vendors —Intel, AMD, IBM (we have relationships with all) Microsoft HPCS Collaborations —New languages and tools must run on systems using multicore chips New NSF Center? —Community development as with CRPC —Major contribution from Intel