1 Titanium Review: GASNet Trace Wei Tu GASNet Trace Wei Tu U.C. Berkeley September 9, 2004.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
2003 Michigan Technological University March 19, Steven Seidel Department of Computer Science Michigan Technological University
Lecture 10: Heap Management CS 540 GMU Spring 2009.
Automated Instrumentation and Monitoring System (AIMS)
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.
CS 4800 By Brandon Andrews.  Specifications  Goals  Applications  Design Steps  Testing.
Reference: Message Passing Fundamentals.
1 Hierarchical Pointer AnalysisAmir Kamil Hierarchical Pointer Analysis for Distributed Programs Amir Kamil U.C. Berkeley December 7, 2005.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
Distributed Process Management
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
MS 9/19/97 implicit coord 1 Implicit Coordination in Clusters David E. Culler Andrea Arpaci-Dusseau Computer Science Division U.C. Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
PRASHANTHI NARAYAN NETTEM.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
1 Distributed Systems: Distributed Process Management – Process Migration.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
DCE (distributed computing environment) DCE (distributed computing environment)
SPMD: Single Program Multiple Data Streams
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Java Virtual Machine Case Study on the Design of JikesRVM.
B. RAMAMURTHY 10/24/ Realizing Concurrency using the thread model.
Simulated Pointers Limitations Of Java Pointers May be used for internal data structures only. Data structure backup requires serialization and deserialization.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Chameleon Automatic Selection of Collections Ohad Shacham Martin VechevEran Yahav Tel Aviv University IBM T.J. Watson Research Center Presented by: Yingyi.
1 File Management Chapter File Management n File management system consists of system utility programs that run as privileged applications n Concerned.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
CS 261 – Data Structures Introduction to C Programming.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
How Much Memory Do I Need? Jack Opgenorth October, 2004.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
Simulated Pointers Limitations Of C++ Pointers May be used for internal data structures only. Data structure backup requires serialization and deserialization.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Implementing Remote Procedure Call Landon Cox February 12, 2016.
MINIX Presented by: Clinton Morse, Joseph Paetz, Theresa Sullivan, and Angela Volk.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Improve Embedded System Stability and Performance through Memory Analysis Tools Bill Graham, Product Line Manager Development Tools November 14, 2006.
Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.
Amir Kamil and Katherine Yelick
Scope, Parameter Passing, Storage Specifiers
Linux kernel: Processes, threads and scheduling
Realizing Concurrency using the thread model
Amir Kamil and Katherine Yelick
CS703 - Advanced Operating Systems
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Support for Adaptivity in ARMCI Using Migratable Objects
Run-time environments
Presentation transcript:

1 Titanium Review: GASNet Trace Wei Tu GASNet Trace Wei Tu U.C. Berkeley September 9, 2004

2 Titanium Review, Sep. 9, 2004 Wei Tu What is GASNet trace? Performance tool for Titanium program Applicable to all GAS languages It presents communication information of a certain run of a program summarizes data from the tracing utility of GASNet

3 Titanium Review, Sep. 9, 2004 Wei Tu Why do we need it? Global Address Space Languages are easier to use than message passing But Performance analysis can be more subtle Unexpected implicit communication

4 Titanium Review, Sep. 9, 2004 Wei Tu Why is it useful? understand the communication pattern of the program discover accidental communication in implicit assignment(memcpy) detect load imbalances evaluate the language runtime library

5 Titanium Review, Sep. 9, 2004 Wei Tu GET / PUT reports LOCAL PUT/GET to memory which is local to the process Global pointer used for local access GLOBAL Global pointer used for remote access *Local pointer not in the report.

6 Titanium Review, Sep. 9, 2004 Wei Tu Example - SharkFish GET GET REPORT: SOURCE LINE TYPE MSG:(min max avg total) CALLS ==============================================================../../../tcbuild/../../src-arity/tlib/java/io/BufferedOutputStream.java 71 LOCAL 4 B 4 B 4 B 7.87 K 2014 … SharksFish.ti 24 LOCAL 8 B 8 B 8 B K SharksFish.ti 25 LOCAL 8 B 8 B 8 B K … SharksFish.ti 273 GLOBAL 8 B 8 B 8 B K SharksFish.ti 273 LOCAL 4 B 8 B 6 B 1.74 M … SharksFish.ti 4public class Fish. { … 23 public void timestep(double dt) { 24pos.x += dt*vel.x; 25pos.y += dt.vel.y; … 268 /* Modify forces on local fish from all the fish */ 269 for(j=0; j<myparticles.length; j++) { … 273double dx = particles[k].pos.x - myparticles[j].pos.x; …

7 Titanium Review, Sep. 9, 2004 Wei Tu BARRIER reports WAIT Time spent blocking at the barrier Reflects load imbalance WAITNOTIFY Time interval between gasnet_notify and gasnet_wait Currently, Titanium only has single phase barriers

8 Titanium Review, Sep. 9, 2004 Wei Tu Example - SharkFish Barrier BARRIER REPORT: SOURCE LINE TYPE TIME(min max avg total) CALLS =======================================================../../../../src-arity/tlib/ti/lang/Reduce-guts.cti 132 NOTIFYWAIT 14.0 us 18.0 us 15.4 us us 2 … SharksFish.ti 209 NOTIFYWAIT 11.0 us 12.0 us 11.8 us 47.0 us 1 SharksFish.ti 209 WAIT 31.0 us 15.5 ms 11.6 ms 46.5 ms 1 … SharksFish.ti 242 NOTIFYWAIT 10.0 us 1.6 ms 26.5 us ms 1000 SharksFish.ti 242 WAIT 28.0 us ms us 2.1 s 1000 … SharksFish.ti … 206 total_time.start(); comm_time.start(); 209 Ti.barrier(); 210 comm_time.stop(); … 240 /* Wait for everyone to catch up. Is this barrier necessary? */ 241 comm_time.start(); 242 Ti.barrier(); 243 comm_time.stop(); …

9 Titanium Review, Sep. 9, 2004 Wei Tu What can I do with it? Functionalities Sort by any FIELD Filter by any TYPE Views Compact - one line per communication/barrier Full - filename on its own line Threaded- show information for each thread

10 Titanium Review, Sep. 9, 2004 Wei Tu Example - ArrayCopy PUT -t -f -filter LOCAL -sort TOTAL PUT REPORT: SOURCE LINE TYPE MSG:(min max avg total) CALLS =============================================================== arrayCopyTest.ti 70 GLOBAL 56 B K K 6.22 M 156 Thread K K K 1.55 M 18 Thread 1 56 B K K 1.55 M 46 Thread 2 56 B K K 1.55 M 46 Thread 3 56 B K K 1.55 M 46 … arrayCopyTest.ti 64 GLOBAL 56 B 56 B 56 B 1008 B 18 Thread 1 56 B 56 B 56 B 336 B 6 Thread 2 56 B 56 B 56 B 336 B 6 Thread 3 56 B 56 B 56 B 336 B 6 arrayCopyTest.ti … 60 long [1d] single [1d] allSrc = new long [0 : Ti.numProcs()-1] [1d]; 61long [1d] single [1d] allDest = new long [0 : Ti.numProcs()-1] [1d]; allSrc.exchange(sharedSrc); 64 allDest.exchange(sharedDest); … 69 // remote -> local 70 prvDest.copy(allSrc[left]); 71 verifyArray(prvDest, left, "remote -> local"); 72 Ti.barrier(); …

11 Titanium Review, Sep. 9, 2004 Wei Tu Example - ArrayCopy Barrier -t -f -filter NOTIFYWAIT -sort TOTAL BARRIER REPORT: SOURCE LINE TYPE TIME:(min max avg total) CALLS ============================================================= arrayCopyTest.ti 70 WAIT 29.0 us ms 6.3 ms 5.9 s 233 Thread us ms 6.2 ms 1.5 s 233 Thread us ms 5.8 ms 1.4 s 233 Thread us ms 6.5 ms 1.5 s 233 Thread us ms 6.7 ms 1.6 s 233 … arrayCopyTest.ti 82 WAIT 51.0 us 1.3 ms us 3.6 ms 5 Thread us 1.3 ms us 1.5 ms 5 Thread us 1.3 ms us 1.5 ms 5 Thread us 58.0 us 55.8 us us 5 Thread us 59.0 us 56.6 us us 5 … * Currently only separate by node instead of thread arrayCopyTest.ti … 69 // remote -> local 70 prvDest.copy(allSrc[left]); 71 verifyArray(prvDest, left, "remote -> local"); 72 Ti.barrier();.. 80 // remote -> remote (same owner) 81 allDest[right].copy(allSrc[right]); 82 Ti.barrier(); 83 verifyArray(sharedDest, Ti.thisProc(), "remote -> remote (same owner)"); 84 Ti.barrier();

12 Titanium Review, Sep. 9, 2004 Wei Tu What to do next? Increase Speed Track memory allocation & usage Track lock/unlock operations Track collective operations Track active messages Separate barriers by thread (instead of node) Data analysis Distribution of resources used in put/get reports Auto detect load imbalance in barrier reports Hide Internals