TEXAS ADVANCED COMPUTING CENTER User experiences on Heterogeneous TACC IBM Power4 System Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing.

Slides:



Advertisements
Similar presentations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Early Experiences with Datastar: A 10TF Power4 + Federation.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Presented by: Priti Lohani
Job Submission on WestGrid Feb on Access Grid.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
IFS Benchmark with Federation Switch John Hague, IBM.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
File System Benchmarking
Introduction to HPC resources for BCB 660 Nirav Merchant
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Cluster Workstations. Recently the distinction between parallel and distributed computers has become blurred with the advent of the network of workstations.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
HEPiX Karlsruhe May 9-13, 2005 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
Parallelization of 2D Lid-Driven Cavity Flow
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Sunpyo Hong, Hyesoon Kim
Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

TEXAS ADVANCED COMPUTING CENTER User experiences on Heterogeneous TACC IBM Power4 System Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing Center University of Texas, Austin ScicomP 8 Minneapolis, MN August 4-8, 2003

TEXAS ADVANCED COMPUTING CENTER Outline Architectural Overview –TACC Heterogeneous Power4 System –Fundamental differences/similarities of TACC Power4 nodes Resource Allocation Policies Scheduling –Simple and advanced job scripts –Pre-processing LL with Job filter Performance Analysis –STREAM, NPB benchmarks –Finite Difference and Molecular Dynamics Applications –MPI Bandwidth Performance Conclusions

TEXAS ADVANCED COMPUTING CENTER TACC Cache/ Micro-architecture Features L1 32KB/data 2-way assoc. (write through) 64KB/instruction direct mapped L2 1.44MB (unified) 8-way assoc. L3 32MB 8-way assoc. memory 32 GB/Node (p690H) 128 GB/Node (p690T) 8 GB/Node (p655H) 128/128/4x128 Byte Lines for L1/L2/L3

TEXAS ADVANCED COMPUTING CENTER Comparison of TACC Power4 Nodes All nodes have same processing speed but different memory configurations; p690H, p655H have 2G/proc; p690T has 4G/proc. Only p690T has dual-core processors hence share L2 cache while others have dedicated L2 cache. p655 nodes have PCI-X adapters while the other nodes have PCI adapters, hence former has faster throughputs on message-passing. Global address snooping is absent on the p655s; this provides about 10% improvement in performance over the p690s.

TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC 32 nodes 4-way SMP 8GB/node 3 nodes 16-way SMP 32GB/node 1 node 32-way 128GB login node 13-way 16GB GPFS nodes 3 1-way 6 GB 16 procs. 32 procs. 48 procs. 128 procs. 22GB 128GB 96 GB 256GB Totals

TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC 32 ports 32 ports IBM HPC Dual-Plane SP Switch2

TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu x16 SP Switch 2 Login P690 Turbo P690 HPC P690 /srcatch 36GB /srcatch 36GB /archive /home 1/4TB /work 4.5 TB HPC P690 HPC P690 HPC P655 /srcatch 18GB /srcatch 18GB local archival x16 work HPC P655

TEXAS ADVANCED COMPUTING CENTER LoadLeveler Batch Facility Used to execute a batch parallel job POE options: Use Environment Variables for LoadLeveler Scheduling –Adapter Specification –MPI parameters –Number of Nodes –Class (Priority) –Consumable Resources Also have simple job scripts (PBS like) to address users migrating from clusters

TEXAS ADVANCED COMPUTING CENTER Job Filter TACC Power4 system is heterogeneous – some nodes have large memories – some have faster communication throughput – some have dual cores, with different processor counts – need to address cluster users Part of the scheduling is just categorizing the job requests into classes Problem with LoadLeveler is that a job cannot be changed from one class to another -- hence a filter has evolved. Filter also optimizes resource allocation and scheduling policies with emphasis on application performance. job submission job filter one of following queues{ LH13, LH16, LH32, LH4 } scheduler determines priority and releases jobs for execution

TEXAS ADVANCED COMPUTING CENTER POE: Simple Job Script I #!/bin/csh … job type = parallel tasks = 16 memory = 1000 walltime = 00:30:00 class = normal queue poe a.out MPI example

TEXAS ADVANCED COMPUTING CENTER POE: Simple Job Script II #!/bin/csh … job type = parallel threads = 16 memory = 1000 walltime = 00:30:00 class = normal queue setenv OMP_NUM_THREADS 16 poe a.out OpenMP example

TEXAS ADVANCED COMPUTING CENTER POE: Advanced Job Script MPI example across Nodes #!/bin/csh … resources = ConsumableCpus(1) ConsumableMemory(1500mb) network.MPI=csss,not_shared,us node = 4 class = normal queue setenv MP_SHARED_MEMORY true poe a.out

TEXAS ADVANCED COMPUTING CENTER NCpTMpTTpN User Input N > 1 N = 1 Mem checks X * 4 Time< f(nodes) Non-shared, csss C=N * CpT * TpN M = TpN * MpT C>32 32>C>1716>C>5 3>C>2 LH32 LH4 LH16 N = 1 _ _ Decision Matrix Derived Values TACC Power4 System Filter Logic (MPI) Non-SharedShared C=4 N = 1 M<2 TpN=1-4TpN=4 TpN<4 TpN=16 N=2or3 M<2 TpN _ _ _ M<2M>2M<2M>2M<2M>2M<4 Shared TpN=C CpT=1 TpN>1 N=1 CpT=1 TpN>1 N>1 N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Task C>32 removed

TEXAS ADVANCED COMPUTING CENTER CpTMpT User Input Mem checks X * 4 Time< f(nodes) Non-shared, csss C= CpT M = (MpT)/C N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Cpu 32>C>1716>C>54>C>1 LH32 LH4 LH16 M<4M<2 _ _ Decision Matrix Derived Values Non-Shared TACC Power4 System Filter Logic (OMP) 16>C>5 _ 2<M<4 CpT>1 TpN=1 N=1

TEXAS ADVANCED COMPUTING CENTER Batch Resource Limits serial (front end) –8 hours, up to 2 jobs, development (front end) –12 GB, 2 hours, upto 8 CPUs normal/high (some default examples) –<= 8 GBs, < 4 CPUs (LH16) –<= 8 GBs, 4 CPUs (LH4) –> 32 GBs, 5<= CPUs <=16 (LH32) –For various other combination possibilities, see the User Guide dedicated (by special request only)

TEXAS ADVANCED COMPUTING CENTER Application Performance Effects of compute intensive and memory intensive scientific applications on the different nodes. Examples are STREAM, NPB etc. Effects of different kinds of MPI functionality on the different nodes. Examples include MPI ping- pong and MPI All-to-All Send_Recv’s

TEXAS ADVANCED COMPUTING CENTER Scientific Applications SM: Stommel model of ocean “circulation” ; solves 2-D partial differential equation. –Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations). –Parallel version uses Domain Decomposition for grid 1Kx1K Memory Intensive Application Nearest Neighbor communication MD: Molecular Dynamics of a Solid Argon Lattice. –Uses Verlet algorithm for propagation (displacement & velocities). –Calculation done for 1 pico second for size Compute Intensive Application Global communications

TEXAS ADVANCED COMPUTING CENTER Scientific Applications II StommelMD lh lh lh lh16 architecture best suited for memory intensive application combined with nearest neighbor communication. L2 cache sharing is most ill suited for memory intensive application. Time(secs)

TEXAS ADVANCED COMPUTING CENTER NAS Parallel Benchmarks lh lh lh4 mgluftisepbtcg Class B results lh lh lh4 mgluftisepbtcg Class C results Time (secs)

TEXAS ADVANCED COMPUTING CENTER STREAM Benchmarks * CopyScaleAddTriads p p690_H p690_T Results courtesy of John McCalpin STREAM web-site

TEXAS ADVANCED COMPUTING CENTER STREAM Benchmarks * Results for p655 used large pages and threads Results for p690s used small pages and threads p655 tested system has 64G memory, TACC system has 8 G of memory p690 tested system has 128G memory, TACC system has 32 G of memory Systems with larger memory and CPUS can prefetch data streaming, hence applications with STREAM like kernels should perform better on p690s than p655s

TEXAS ADVANCED COMPUTING CENTER MPI On-node Performance IBM P690 Turbo K M IBM P690 HPC K M IBM P655 HPC K M Ping Pong IBM P690 Turbo K 549 2M IBM P690 HPC K 862 2M IBM P655 HPC K M Bisection Bandwidth

TEXAS ADVANCED COMPUTING CENTER MPI Off-node Performance IBM P M-4M IBM P M-4M Sustained off-node range measurements Cruiser adapter on p655s vs Corsair on p690s

TEXAS ADVANCED COMPUTING CENTER Thoughts and Comments p690T are most suited for large memory, threaded (OMP) type applications. Applications such as FD, which typically contain nearest-neighbor communication with large memory requirements, are best suited for p690H type nodes. Large MPI distributed jobs are best suited for p655 nodes as they are most balanced nodes. Latency sensitive, but small MPI jobs are better suited for p690H node than using the slower interconnect with p655s. In general, p690s are more limited by slower interconnect than helped by shared memory -- exceptions include FD, Linpack.