Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

Slides:

Advertisements

Similar presentations

Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand G. Santhanaraman, P. Balaji, K. Gopalakrishnan, R. Thakur,

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing P. Balaji, Argonne National Laboratory W. Feng and J. Archuleta, Virginia Tech.

Multi-core and Network Aware MPI Topology Functions Mohammad J. Rashti, Jonathan Green, Pavan Balaji, Ahmad Afsahi, and William D. Gropp Department of.

Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.

Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements, and Evaluation Tom Scogland * P. Balaji + W. Feng * G. Narayanaswamy *

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Communication Pattern Based Node Selection for Shared Networks

Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,

Introduction CS 524 – High-Performance Computing.

Top500: Red Storm An abstract. Matt Baumert 04/22/2008.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Energy Issues in Data Analytics Domenico Talia Carmela Comito Università della Calabria & CNR-ICAR Italy

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

2006/1/23Yutaka Ishikawa, The University of Tokyo1 An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo Grid Technology.

Ishikawa, The University of Tokyo1 GridMPI ： Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Installation of Storage Foundation for Windows High Availability 5.1 SP2 1 Daniel Schnack Principle Technical Support Engineer.

Grid Defense Against Malicious Cascading Failure Paulo Shakarian, Hansheng Lei Dept. Electrical Engineering and Computer Science, Network Science Center,

By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

Multi-core CPU’s April 9, Multi-Core at BNL First purchase of AMD dual-core in 2006 First purchase of Intel multi-core in 2007 –dual-core in early.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Brief introduction about “Grid at LNS”

Community Grids Laboratory

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

CARLA Buenos Aires, Argentina - Sept , 2017

Presentation transcript:

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp. Science Argonne National Laboratory

Multi-core Systems: Revolutionizing HEC Significant driving force in the growing scale of High-End Computing (HEC) systems –Low-cost, Low-power usage –Quad-core systems are commodity today (Intel, AMD) –Future processors have many more cores (Intel Xscale) General purpose computing processing elements –X86, PPC, MIPS and other general purpose instruction sets –OS exposes each core as a different processor Can schedule a process on each core –Applications just run !

Communication in Multi-core Systems Immediate Adoption is simple, performance tuning is not –E.g., communication tuning (memory tuning is another) Moore’s law driving the number of cores per die up ! –Processes sharing network link doubling every months Intra-node traffic increasing as well –Increases with increasing number of cores as well More network requirement or lesser? –More network sharing, but more intra-node traffic as well Application communication is critical to whether multi-cores help or hurt communication performance

Network Sharing in Multi-core Systems More processes per node means more processes sharing the same network link More processes per node means more intra-node communication, and potentially lesser network traffic What kind of application patterns generate more traffic? What kind of application patterns generate less traffic? Does process reordering between cores help?

Presentation Outline Introduction and Motivation Experimental Evaluation of the NAS Benchmarks Behavioral Analysis of the NAS Benchmarks Concluding Remarks and Future Work

Experimental Setup 16-node dual-processor dual-core cluster –AMD Opteron 2.55GHz with DDR2 667MHz RAM Definitions: –Co-processor Mode: Use one core per processor –Virtual Processor Mode: Use both cores per processor Myri-10G Co-Processor Mode Virtual Processor Mode

Impact of Network Sharing

Impact of Processor Sharing

Resource Usage in Processor Sharing

Presentation Outline Introduction and Motivation Experimental Evaluation of the NAS Benchmarks Behavioral Analysis of the NAS Benchmarks Concluding Remarks and Future Work

Behavioral Analysis: CG Forms sub-groups of processes which communicate mainly with each other Clustering these groups together increases intra- node communication Contiguous ranks cluster together; single dimension of clustering !

Behavioral Analysis: FT After each step of communication, the data grid is transposed along one dimension (example: P3DFFT) Communication is an Alltoallv for a sub-communicator (contains processes in one dimension) Grouping processes in one dimension will cause the other dimension to suffer

Impact of Process-Core Reordering

Presentation Outline Introduction and Motivation Experimental Evaluation of the NAS Benchmarks Behavioral Analysis of the NAS Benchmarks Concluding Remarks and Future Work

Multi-core systems are revolutionizing HEC –Low cost, low power –Applications just run ! –Immediate adoption is simple, performance tuning is not E.g., Communication patterns on multi-core systems are complex Analyzed communication behavior –Case Study with the NAS benchmarks –Increased network and resource sharing hurts performance –Use application patterns and reorder process-core mappings – improves performance in some cases Future Work: Incorporating application pattern information as hints to MPICH2 (through the process manager)

Thank You Contacts: Ganesh Narayanaswamy: Pavan Balaji: Wu-chun Feng: For More Information: