Parallelization of the Classic Gram-Schmidt QR-Factorization

Slides:



Advertisements
Similar presentations
Dynamic Source Routing (DSR) algorithm is simple and best suited for high mobility nodes in wireless ad hoc networks. Due to high mobility in ad-hoc network,
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Cache Heng Sovannarith
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Characteristics of a Microprocessor. The microprocessor is the defining trait of a computer, so it is important to understand the characteristics used.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Information Technology Center Introduction to High Performance Computing at KFUPM.
OpenFOAM on a GPU-based Heterogeneous Cluster
SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Dynamic Load Balancing Experiments in a Grid Vrije Universiteit Amsterdam, The Netherlands CWI Amsterdam, The
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
The hybird approach to programming clusters of multi-core architetures.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Collective Communication
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
3D Graphics Module Ramesh Srigiriraju. Abstract Project Areas: 3D Graphics, Modular design Purpose: to test data schemes for 3D rotatons Four different.
Unit 2 - Hardware Microprocessors & CPUs. What is a microprocessor? ● The brain of the computer, the microprocessor is responsible for organizing and.
Integrating Trilinos Solvers to SEAM code Dagoberto A.R. Justo – UNM Tim Warburton – UNM Bill Spotz – Sandia.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Distributed Verification of Multi-threaded C++ Programs Stefan Edelkamp joint work with Damian Sulewski and Shahid Jabbar.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Optimal Client-Server Assignment for Internet Distributed Systems.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Parallelization of 2D Lid-Driven Cavity Flow
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Lab 2 Parallel processing using NIOS II processors
Parallel Computing With High Performance Computing Clusters (HPCs) By Jeremy Cathey.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
Dynamic Control of Coding for Progressive Packet Arrivals in DTNs.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Energy-Efficient Protocol for Cooperative Networks.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
School of Engineering University of Guelph
A Presentation on online voting system
STUDY AND IMPLEMENTATION
Memory System Performance Chapter 3
Presentation transcript:

Parallelization of the Classic Gram-Schmidt QR-Factorization M543 Final Project Ron Caplan May. 6th 2008

Parallelization with MPI Parallel Computation Results Conclusion Overview Introduction MATLAB vs. FORTRAN Parallelization with MPI Parallel Computation Results Conclusion

Introduction “It is well established that the classical Gram-Schmidt process is one of the unstable ones. Consequently it is rarely used, except sometimes on parallel computers in situations where advantages in communication may outweigh the disadvantage of instability.” - Trefethem-Bau pg. 66

MATLAB vs. FORTRAN Pentium 4 2.0 Ghz with 512 KB cache, 512MB RDRAM running Windows XP MATLAB Student Version 7 and g95 with cygwin Each run is done 5 times and the times averaged. In-between runs, memory is cleared.

MATLAB vs. FORTRAN

MATLAB vs. FORTRAN

Parallelization with MPI Collect Q and R 1 vector at a time: Compute Part of Q & R: Q R Send out A To all nodes 1 vector at a time: A A A A A 4-node example Four Dual Quad-core Xeons with 4GB RAM per Node configured as 4 Processors Per Node for MPI/Batch (16 total processors)

Parallelization with MPI

Parallel Computation Results Using Frobenius Norm

Parallel Computation Results

Parallel Computation Results

Parallel Computation Results

Parallel Computation Results

Parallel Computation Results

Parallel Computation Results Should dynamically change send packet size, and probably use sends and receives instead of broadcast and gather…. Why? Because: Also, since CGS inner loop is from i=1:j-1, the distribution of work to Each node is not equal. This should be fixed as well.

Conclusion Parallelization of CGS does give nice speed-up for large matrices. Adding some improvements to the code such as non-uniform block sizes, and optimizing communication method would lead to even better speed-up.