Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Distributed Systems CS

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Parallel Computing Majid Almeshari John Conklin. Outline The Challenge Available Parallelization Resources Status of Parallelization Plan & Next Step.

Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,

Progress in Two-second Filter Development Michael Heifetz, John Conklin.

Job Submission on WestGrid Feb on Access Grid.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Query Evaluation Techniques for Cluster Database Systems Andrey V. Lepikhov, Leonid B. Sokolinsky South Ural State University Russia 22 September 2010.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Outline Introduction Image Registration High Performance Computing Desired Testing Methodology Reviewed Registration Methods Preliminary Results Future.

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine.

4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.

Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Distributed Verification of Multi-threaded C++ Programs Stefan Edelkamp joint work with Damian Sulewski and Shahid Jabbar.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Outline 3  PWA overview Computational challenges in Partial Wave Analysis Comparison of new and old PWA software design - performance issues Maciej Swat.

Parallelization of the Classic Gram-Schmidt QR-Factorization

Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Parallelization of 2D Lid-Driven Cavity Flow

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Full and Para Virtualization

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

A Parallel Data Mining Package Using MatlabMPI

Summary Background Introduction in algorithms and applications

HPML Conference, Lyon, Sept 2018

VSIPL++: Parallel Performance HPEC 2004

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Outline Challenge Requirements Resources Approach Status Tools for Processing

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Challenge A computationally intensive algorithm is applied on a huge set of data. Verification of filter’s correctness takes a long time and hinders progress.

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Requirements Achieve a speed-up between 5-10x over serial version Minimize parallelization overhead Minimize time to parallelize new releases of code Achieve reasonable scalability Number of Processors Time (sec) SAC 19

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Resources Hardware –Computer Cluster (Regelation), a bit CPUs »64-bit enables us to address a memory space beyond 4 GB »Using this cluster because: (a) likely will not need more than 44 processors (b) same platform as our Linux boxes Software –MatlabMPI »Matlab parallel computing toolbox created by MIT Lincoln Lab »Eliminates the need to port our code into another language

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Approach - Serial Code Structure Initialization Loop over iterations Loop Gyros Loop over orbits Loop over Segments Build h(x) (the right hand side) Build Jacobian, J(x) Load data (Z), perform fit Bayesian estimation, display results Computationally Intensive Components (Parallelize!!!)

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Approach - Parallel Code Structure Initialization Loop over iterations Loop Gyros Loop over orbits Loop over Segments Build h(x) Build Jacobian, J(x) Bayesian estimation, display results Distribute J(x), all nodes Load data (Z), perform fit Loop over partial orbits Send information matrices to master & combine In parallel, split up by columns of J(x) In parallel, split up by orbits Custom data distribution code required (Overhead)

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Code in Action Proc 1Proc 2Proc 3 Proc n Proc 1Proc 2Proc 3 Proc n Go for next Gyro, Segment Go for another Iteration Initialization Compute h and Partial J Sync Point Get Complete J Perform Fit Sync Point Combine Info Matrix Bayesian Estimation Another Iteration? Split by Columns of J Split by Rows of J

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Parallelization Techniques - 1 Minimize inter-processor communication –Use MatlabMPI for one-to-one communication only –Use a custom approach for one-to-many communication Processors Time (Sec) Comm time with custom approach Comm time with MPI

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Parallelization Techniques - 2 Balance processing load among processors Load per processor = ∑ length(reducedVector k )*(computationWeight k + diskIOWeight) k: {RT, Tel, Cg, TFM, S 0 } Processors Time (Sec) CPUs have uniform load

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Parallelization Techniques - 3 Minimize memory footprint of slave code –Tightly manage memory to prevent swapping to disk, or “out of memory” errors –We managed to save ~40% of RAM per processor and never seen “out of memory” since

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Parallelization Techniques - 4 Find contention points –Cache what can be cached –Optimize what can be optimized Time (Sec)

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Status – Current Speed up Number of Processors Speedup

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Tools for Processing - 1 Run Manager/Scheduler/Version Controller –Built by Ahmad Aljadaan –Given a batch of options files, it schedules batches of runs, serial or parallel, on cluster –Facilitates version control by creating a unique directory for each run with its options file, code and results –Used with large number of test cases –Helped us schedule runs since January 2010 »and counting…

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Tools for Processing - 2 Result Viewer –Built by Ahmad Aljadaan –Allows searching for runs, displays results, plots related charts –Facilitates comparison of results

Science Advisory Committee Meeting September 3, 2010 Stanford University 04_Parallel Processing Questions