Www.vacet.org E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.

Slides:



Advertisements
Similar presentations
Secure Operating Systems Lesson 2: OS Fundamentals.
Advertisements

Hank Childs Lawrence Berkeley National Laboratory /
1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Chapter 6: Process Synchronization
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Spark: Cluster Computing with Working Sets
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs Lawrence Berkeley National Laboratory / University.
490dp Synchronous vs. Asynchronous Invocation Robert Grimm.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
VisIt Software Engineering Infrastructure and Release Process LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Rockville, MD 28 April 2009 Rockville, MD 28 April 2009 Answers to Review Panel Questions.
Nuclear Energy Work Hank Childs & Christoph Garth April 15, 2010.
Support for Global Cloud Resolving Model Simulations VACET All-Hands Meeting IEEE Vis 2008.
Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
CS533 Concepts of Operating Systems Class 2 Thread vs Event-Based Programming.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Large Data Visualization on Distributed Memory Multi-GPU Clusters Thomas Fogal, Hank Childs, Siddharth Shankar, Jens Krüger, R. Daniel Bergeron, Philip.
1 Concurrency: Deadlock and Starvation Chapter 6.
CS533 Concepts of Operating Systems Class 2 The Duality of Threads and Events.
Challenges and Solutions for Visual Data Analysis on Current and Emerging HPC Platforms Wes Bethel & Hank Childs, Lawrence Berkeley Lab July 20, 2011.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
CIFS in Alfresco 4.0 Mark Rogers Senior Software Engineer, Alfresco.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Experiments with Pure Parallelism Hank Childs, Dave Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, & Wes Bethel April 13, 2010.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Nov. 14, 2012 Hank Childs, Lawrence Berkeley Jeremy Meredith, Oak Ridge Pat McCormick, Los Alamos Chris Sewell, Los Alamos Ken Moreland, Sandia Panel at.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Use of Coverity & Valgrind in Geant4 Gabriele Cosmo.
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Kernel Locking Techniques by Robert Love presented by Scott Price.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Threads Tutorial #7 CPSC 261. A thread is a virtual processor Each thread is provided the illusion that it owns a core – Copy of the registers – It is.
Hank Childs, University of Oregon Volume Rendering Primer / Intro to VisIt.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.
By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan
Debugging on Shared Memory Introduction to Valgrind Multithreads Tools & Principles 林孟潇
CSE 153 Design of Operating Systems Winter 2015 Midterm Review.
More on Thread Safety CSE451 Andrew Whitaker. Review: Thread Hazards Safety hazards  “Program does the wrong thing” Liveness hazards  “Program never.
MINIX Presented by: Clinton Morse, Joseph Paetz, Theresa Sullivan, and Angela Volk.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
Tuning Threaded Code with Intel® Parallel Amplifier.
 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
Apache Ignite Compute Grid Research Corey Pentasuglia.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
VisIt Project Overview
About Me I'm a software Committer on HDFS
Why Events Are A Bad Idea (for high-concurrency servers)
Welcome: Intel Multicore Research Conference
VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock
Critical sections, locking, monitors, etc.
CS510 Concurrent Systems Jonathan Walpole.
EE 193: Parallel Computing
CS510 Concurrent Systems Jonathan Walpole.
More on Thread Safety CSE451 Andrew Whitaker.
Parallelism and Concurrency
Concurrency: Mutual Exclusion and Process Synchronization
Software Transactional Memory Should Not be Obstruction-Free
CSE 153 Design of Operating Systems Winter 2019
Dynamic Binary Translators and Instrumenters
Presentation transcript:

E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU (LLNL), BERND HAMANN (UC DAVIS), CHARLES HANSEN (UTAH), DAN LANEY (LLNL), PETER LINDSTROM (LLNL), JEREMY MEREDITH (ORNL), GEORGE OSTROUCHOV (ORNL), STEVEN PARKER (UTAH), CLAUDIO SILVA (UTAH), XAVIER TRICOCHE (UTAH), ALLEN SANDERSON (UTAH), HANK CHILDS (LLNL) Lessons Learned From the MPI- Hybrid Parallelism for Streamlines on Large Multi-Core Clusters Project David Camp (IDAV)

MPI-Hybrid Other VACET projects have shown good performance gains with MPI-Hybrid This project wanted to explore MPI-Hybrid style with two standard Streamlines algorithms, LOD and Static Domains Talk about some of the problems encountered and performance gains Other VACET projects have shown good performance gains with MPI-Hybrid This project wanted to explore MPI-Hybrid style with two standard Streamlines algorithms, LOD and Static Domains Talk about some of the problems encountered and performance gains

Baseline test for MPI-Hybrid Original MPI test –Ran in 100 seconds, on 128 cores First MPI-Hybrid Test –Ran in ~20,000 seconds, on 128 cores Final MPI-Hybrid Test, After many fixes –Ran in 15 seconds, on 128 cores Original MPI test –Ran in 100 seconds, on 128 cores First MPI-Hybrid Test –Ran in ~20,000 seconds, on 128 cores Final MPI-Hybrid Test, After many fixes –Ran in 15 seconds, on 128 cores

VTK At the heart of VTK pipeline is the data time stamp –This is used to drive their data flow model –Every action in VTK changes the data time stamp vtkTimeStamp::Modified() –A small test found Call ~1,000,000 times –Found a pthread_mutex_lock to protect the time stamp At the heart of VTK pipeline is the data time stamp –This is used to drive their data flow model –Every action in VTK changes the data time stamp vtkTimeStamp::Modified() –A small test found Call ~1,000,000 times –Found a pthread_mutex_lock to protect the time stamp

Crashing in VTK VTK – Thread Safe? –Documentation said Thread Safe –Look like memory corruption VTK – Documents Say Thread Safe –But many function where defined “Not Thread Safe” –Some “This Method is Thread Safe if first called from a Single Thread and the dataset is not Modified” Real Answer is VTK is not Thread Safe –vtkObjectBase did not protect it reference count variable, so Data Concurrency was lost. –Memory was being delete before it life time had truly ended VTK – Thread Safe? –Documentation said Thread Safe –Look like memory corruption VTK – Documents Say Thread Safe –But many function where defined “Not Thread Safe” –Some “This Method is Thread Safe if first called from a Single Thread and the dataset is not Modified” Real Answer is VTK is not Thread Safe –vtkObjectBase did not protect it reference count variable, so Data Concurrency was lost. –Memory was being delete before it life time had truly ended

C++ Exception Across Share Libraries Streamline code used an Exception to handle data boundary condition –Linux used a pthread_mutex_lock to handle this Execption Code was change to remove the exception Streamline code used an Exception to handle data boundary condition –Linux used a pthread_mutex_lock to handle this Execption Code was change to remove the exception

VTK – Object Creation VTK forces you to use it’s New function –VTK uses a factory method pattern vtkObjectFactory –Used to override VTK classes with custom versions. –It used strcmp to match object –Strcmp was the most called function in the Streamlines test VTK forces you to use it’s New function –VTK uses a factory method pattern vtkObjectFactory –Used to override VTK classes with custom versions. –It used strcmp to match object –Strcmp was the most called function in the Streamlines test

I/O Found that MPI I/O was better –They where doing multi-I/O operations by default by running four process per node Changed the Streamline code to thread I/O Found that MPI I/O was better –They where doing multi-I/O operations by default by running four process per node Changed the Streamline code to thread I/O

Conclusion – Hard Work Pays Off Original MPI test –Run on Jaguar –100 seconds (10,000 Streamlines 128 cores) Original MPI test with code improvements –Run on Franklin –45 seconds (20,000 Streamlines 128 cores) MPI-Hybrid test –Run on Franklin –15 seconds (20,000 Streamlines 128 cores) Original MPI test –Run on Jaguar –100 seconds (10,000 Streamlines 128 cores) Original MPI test with code improvements –Run on Franklin –45 seconds (20,000 Streamlines 128 cores) MPI-Hybrid test –Run on Franklin –15 seconds (20,000 Streamlines 128 cores)

Common Problems with Threads Concurrency –One or more threads access the same memory location without locking. Lock Contention –One thread blocks the progress of one or more other threads by holding a lock too long. VisIt’s Data Cache VisIt’s Timer Deadlock –When two or more threads wait for each other lock, indefinitely. Valgrind: DRD and helgrind –Tools to help find these types of problems. Concurrency –One or more threads access the same memory location without locking. Lock Contention –One thread blocks the progress of one or more other threads by holding a lock too long. VisIt’s Data Cache VisIt’s Timer Deadlock –When two or more threads wait for each other lock, indefinitely. Valgrind: DRD and helgrind –Tools to help find these types of problems.