Download presentation
Presentation is loading. Please wait.
Published byRoberta Chase Modified over 9 years ago
1
Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf
2
Center for High Performance Software Research CRMC Overview Initial Participation from Three Institutions —Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner —Indiana: Geoffrey Fox, Dennis Gannon —Tennessee: Jack Dongarra Activities —Research and prototype development —Community building –Workshops and meetings —Other outreach components (separately funded) Planning and Management —Coordinated management and vision-building –Model: CRPC
3
Center for High Performance Software Research Management Strategy Pioneered in CRPC and honed in GrADS/VGrADS/LACSI Leadership forms a broad vision-building team —Problem identification —Willingness to redirect research to address new challenges —Complementary research areas –Willingness to look at problems from multiple dimensions —Joint projects between sites Community-building activities —Focused workshops on key topics –CRPC: TSPLib, BLACS/ScaLAPACK –LACSI: Autotuning —Informal standardization –CRPC: MPI and HPF Annual planning cycle —Plan, research, report, evaluate,…
4
Center for High Performance Software Research Research Areas I Compilers and programming tools —Tools: Performance analysis and prediction (HPCToolkit) —Transformations: memory hierarchy and parallelism —Automatic tuning strategies Programming models and languages —High-level languages: Matlab, Python, R, etc —HPCS languages —Programming models based on component integration Run-time systems —Core run-time data movement library —Integration with MPI Libraries —Adaptive, reconfigurable libraries optimized for multicore systems
5
Center for High Performance Software Research Research Areas II Applications for multicore systems —Classical parallel/scientific applications —Commercial applications with advice from industrial partners Interface between software and architecture —Facilities for managing bandwidth (controllable caches, scratch memory) —Sample-based profiling facilities —Heterogeneous cores Fault tolerance —Redundant components —Diskless checkpointing Multicore emulator —Research platform for future systems
6
Center for High Performance Software Research Performance Analysis and Prediction HPC Toolkit (Mellor-Crummey) —Uses sample-based profiling combined with binary analysis to report performance issues (recompilation not required) –How to extend to multicore environment? Performance prediction (Mellor-Crummey) —Currently using a performance prediction methodology that accurately accounts for memory hierarchy –Reuse-distance histograms based on training data, parameterized by input data size –Accurately determines frequency of miss at each reference —Extension to shared-cache multicore systems (underway)
7
Center for High Performance Software Research Bandwidth Management Multicore raises computational power rapidly —Bandwidth onto chip unlikely to keep up Multicore systems will feature shared caches —Replaces false sharing with enhanced probability of conflict misses Challenges for effective use of bandwidth —Enhancing reuse when multiple processors are using cache —Reorganizing data to increase density of cache block use —Reorganizing computation to ensure reuse of data by multiple cores –Inter-core pipelining —Managing conflict misses –With and without architectural help Without architectural help —Data reorganization within pages and synchronization to minimize conflict misses –May require special memory allocation run-time primitives
8
Center for High Performance Software Research Conflict Misses Unfortunate fact: —If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and —All k strips overlap in one associativity group, then –Every access to the overlap group location is a miss 2 1 3 On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1 In a 2-way associative cache, all are misses! This limits loop fusion, a profitable reuse strategy
9
Center for High Performance Software Research Controlling Conflicts: An Example Cache and Page Parameters —256K Cache, 4-way set associative, 32-byte blocks –1024 associativity groups —64K Page –2048 cache blocks —Each block in a page maps to a unique associativity group –2 different lines in a page map to the same associativity group In General —Let A = number of associativity groups in cache —Let P = number of cache blocks in a page —If P ≥ A then each block in a page maps to a single associativity group –No matter where the page is loaded —If P < A then a block can map to A/P different associativity groups –Depending on where the page is loaded
10
Center for High Performance Software Research Questions Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation? —Extensive work on minimizing self-conflict misses —Little work on inter-array conflict minimization —No work, to my knowledge, on interprocessor conflict minimization Can we synchronize computations so that multiple cores do not interfere with one another? —Even reuse blocks across processors Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses —Allocation of part of cache as a scratchpad —Dynamic modification of cache mapping
11
Center for High Performance Software Research Parallelism On a shared-cache multicore chip, running the same program using multiple processors has a major advantage —Possibility for reuse of cache blocks across processors —Some chance for controlling conflict misses How can parallelism be found and exploited? —Automatic methods on scientific languages –Much progress was made in the 90s —Explicit parallel programming and thread management paradigms –Data parallel (HPF, Chapel) –Partitioned global address space (Co-Array Fortran, UPC) –Lightweight threading (OpenMP,CCR) —Software synchronization primitives —Integration of parallel component libraries –Telescoping languages –Parallel Matlab
12
Center for High Performance Software Research Automatic Tuning Following ATLAS Tuning generalized component libraries in advance —For different platforms —For different contexts in the same platform –May wish to chose a variant that uses a subset of the cache that does not conflict with the calling program Extensive work at Rice and Tennessee —Heuristic search combined with compiler models cut tuning time —Many transformations: unroll-and-jam, tiling, fusion, etc. –Interact with one another New challenges for multicore —Tuning of on-chip multiprocessors to use shared (and non-shared) memory hierarchy effectively —Management of on-chip parallelism and threading
13
Center for High Performance Software Research Other Compiler Challenges Multicore chips used in scalable parallel machines —Multiple kinds of parallelism: on-chip, within an SMP group, distributed memory Heterogeneous multicore chips (Grid on a chip) —In the Intel roadmap —Challenge: decomposing computations to match strengths of different cores –Static and dynamic strategies may be required –Performance models for subcomputations on different cores –Interaction of heterogeneity and memory hierarchy –Staging computations through shared cache Workflow steps running on different cores Component-composition programming environments —Graphical or construction from scripts
14
Center for High Performance Software Research Component Library Component Library Application Translator Application Translator Optimized Application Optimized Application Vendor Compiler Vendor Compiler Optimizer Generator Optimizer Generator Could run for hours Application Optimizer Application Optimizer Understands library calls as primitives Telescoping Languages Scripting language or standard language, (Fortran or C++)
15
Center for High Performance Software Research Compiler Infrastructure D System Infrastructure —Includes full dependence analysis —Support for high-level transformations –Register, cache, fusion —Support for parallelism and communication management –Originally used for HPF Telescoping Languages Infrastructure —Constructed for Matlab compilation and component integration —Constraint-based type analysis –Produces type-jump functions for libraries —Variant specialization and selection —Applied to parallel Matlab project Both currently distributed under BSD-style license (no GPL) Open64 Compiler Infrastructure —GPL License
16
Center for High Performance Software Research Proposal An NSF Center for Research on Multicore Computing —Modeled after CRPC –Core research program –Multiple participating institutions —Research –Compilers and tools –Architectural modifications, supported by simulation –Run-time systems and communication/synchronization –Driven by real applications from NSF community —Big community outreach program –Specific topical workshops —Major investment from Intel Coupled with Multicore Computing Research Program —Designed to foster a vibrant community of researchers
17
Center for High Performance Software Research Leverage DOE SciDAC Projects —Currently proposed: A major Enabling Technology Center –Kennedy: CScADS (includes infrastructure development) —Participants in several other relevant SciDAC efforts –PERC2, PModels LACSI Projects —Subject to ASC budget Chip Vendors —Intel, AMD, IBM (we have relationships with all) Microsoft HPCS Collaborations —New languages and tools must run on systems using multicore chips New NSF Center? —Community development as with CRPC —Major contribution from Intel
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.