Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)

Slides:

Advertisements

Similar presentations

Opportunities for SLEUTH in High-performance Computing Qingfeng (Gene) Guan, Ph.D. Center for Advanced Land Management Information Technologies School.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

GPU Computing with CUDA as a focus Christie Donovan.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.

AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

More Accurate Rate Estimation CS 170: Computing for the Sciences and Mathematics.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architectural Considerations for Cellular Automata Programming A comparison of performance between a x86 CPU and nVidia Graphics Card Stephen Orchowski,

Introduction to Lattice Simulations. Cellular Automata What are Cellular Automata or CA? A cellular automata is a discrete model used to study a range.

Cellular Automata. The Game The Game of Life is not your typical computer game. It is a 'cellular automation', and was invented by the Cambridge mathematician.

Research Into the Time Reversal of Cellular Automata Team rm -rf / Daniel Kaplun, Dominic Labanowski, Alex Lesman.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

National Center for Supercomputing Applications University of Illinois at Urbana–Champaign Visualization Support for XSEDE and Blue Waters DOE Graphics.

Conway’s Game of Life Jess Barak Game Theory. History Invented by John Conway in 1970 Wanted to simplify problem from 1940s presented by John von Neumann.

NICS Update Bruce Loftis 16 December National Institute for Computational Sciences University of Tennessee and ORNL partnership  NICS is the 2.

Understanding Parallel Computers Parallel Processing EE 613.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.

Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Cellular Automata Project:

Low-Cost High-Performance Computing Via Consumer GPUs

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

2009 AAG Annual Meeting Las Vegas, NV March 25th, 2009

Hybrid Programming with OpenMP and MPI

Hardware Accelerated Video Decoding in

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011) and Zhong Chen (2012) School of Computational Science and Engineering (CSE) College of Computing, Georgia Institute of Technology

Overview  Keeneland and Kraken: the Cyberinfrastructure for our research and development  Scalable and high performance geospatial software modules developed in the past 1 year and 7 months

Keeneland: a hybrid computer architecture and system  A five-year Track 2D cooperative agreement awarded by the National Science Foundation (NSF) in 2009  Developed by GA Tech, UT-Knoxville, and ORNL  120 nodes [240 CPUs GPUs]  Integrated into XSEDE in July 2012  Blue Waters – full scale of hybrid computer systems

Kraken: a Cray XT5 supercomputer  As of November 2010, Kraken is the 8th fastest computer in the world  The world’s first academic supercomputer to enter the petascale  Peak performance of 1.17 PetaFLOPs  112,896 computing cores (18, GHz six-core AMD Opteron processors)  147 TB of memory

Scalable and high performance geospatial computation (1) Scalable and high performance geospatial computation (1) Data Size Time and Speedup on desktop Time and Speedup on Keeneland Single CPU Single GPU 1 GPU 3 GPUs 6 GPUs 9 GPUs (22.2) 15.3 / 87 3 / / / (41.7) 14.6 / / / / (48.8) 16.5 / / / / (62.0) 17.1 / / / / / (66.3) 18.4 / / / / / (81.3) 20.6 / / / / / (84.4) 21.2 / / / / / 723  Performance comparison based on different scale of data (i.e. number of sample points) and the computing resources (Time is counted in second)  Speedup is calculated by the time used on a single CPU divided by the time used on the GPU(s)  Interpolation is calculated based on the value of 12 nearest neighbors  Output grid size: 1M+ cells Interpolation Using IDW Algorithm on GPU and Keeneland

Data Size Time/speedup on desktop Time/Speedup on Keeneland Single CPU Single GPU 1 GPU 3 GPUs 6 GPUs 9 GPUs (11.2) 56 / 12 7 / 96 4 / / (26.2) 66 / 24 8 / / / / (32.7) 65 / 30 7 / / / / (46.2) 52 / 53 6 / / / / 462 Scalable and high performance geospatial computation (2) Scalable and high performance geospatial computation (2)  Performance comparison based on different scale of data (i.e. number of sample points) and the computing resources (Time is counted in second)  Speedup is calculated by the time used on a single CPU divided by the time used on the GPU(s)  Interpolation is calculated based on the value of 10 nearest neighbors  Output grid size: 1M+ cells Interpolation Using Kriging Algorithm on GPU and Keeneland Three Kriging approaches a) Spherical, b) Exponential, and c) Gaussian have been implemented on GPU/Keeneland

Scalable and High Performance Geospatial Computation (3) Parallelizing Cellular Automata (CA) on GPU and Keeneland (1)  Cellular Automata (CA) is the foundation for geospatial modeling and simulation, such as SLEUTH for urban growth simulation  Game of Life (GOL), invented by Cambridge mathematician John Conway, is a well-known generic CA that consists of a collection of cells which, based on a few mathematical rules, can live, die or multiply. The Rules:  For a space that is 'populated':  Each cell with one or no neighbors dies, as if by loneliness.  Each cell with four or more neighbors dies, as if by overpopulation.  Each cell with two or three neighbors survives.  For a space that is 'empty' or 'unpopulated'  Each cell with three neighbors becomes populated.

Scalable and High Performance Geospatial Computation (3) Parallelizing Cellular Automata on GPU and Keeneland (2)  Size of CA: 10,000 x 10,000  Number of iterations: 100  CPU time: ~ 100 minutes  GPU [desktop] time: ~ 6 minutes  Keeneland [20 GPUs]: 20 seconds CPU  Intel Xeon CPU 1.60 GHz, 3.25 GB of RAM GPU  NVIDIA GeForce GTX 260 with 27 streaming multiprocessors (SM) A cell is “born” if it has exactly 3 neighbors, stays alive if it has 2 or 3 living neighbors, and dies otherwise.  A simple SLEUTH model has implemented on a single GPU  Implementation on Kraken and Keeneland using multiple GPUs is under development

Scalable and High Performance Geospatial Computation (4) Parallelizing ISODATA for Unsupervised Image Classification on Kraken (1) Iterative Self-Organizing Data Analysis Technique Algorithm (ISODATA) Performance comparison :: ERDAS uses 3:44:37 (13,477 seconds) to read image file [~ 2 minutes] and do the classification over one tile of 18 GB imagery data [0.5 m resolution in three bands] Number of Cores Stripe Count Stripe Size (MB) Read Time (Sec) Classification Time (Sec) Our solution over Kraken using different number of cores with optimized stripe count and stripe size. Tue Jun 12 12:48:37 EDT 2012 Iteration 1: convergence = Iteration 2: convergence = Iteration 3: convergence = Iteration 4: convergence = Classification completed ---- The reading file time is The classification time is The total ISODATA algorithm running time is Histogram: Class 0: Class 1: Class 2: Class 3: Class 4: Class 5: Application resources: utime ~30211s, stime ~1215s Tue Jun 12 12:49:06 EDT 2012 Tue Jun 12 15:39:10 EDT 2012 Iteration 1: convergence = Iteration 2: convergence = Iteration 3: convergence = Iteration 4: convergence = Classification completed ---- The reading file time is The classification time is The total ISODATA algorithm running time is Histogram: Class 0: Class 1: Class 2: Class 3: Class 4: Class 5: Application resources: utime ~208415s, stime ~4110s Tue Jun 12 15:40:18 EDT 2012 Tue Jun 12 14:24:23 EDT 2012 Iteration 1: convergence = Iteration 2: convergence = Iteration 3: convergence = Iteration 4: convergence = Classification completed ---- The reading file time is The classification time is The total ISODATA algorithm running time is Histogram: Class 0: Class 1: Class 2: Class 3: Class 4: Class 5: Application resources: utime ~78392s, stime ~2164s Tue Jun 12 14:25:05 EDT 2012 Tue Jun 12 16:06:31 EDT 2012 Iteration 1: convergence = Iteration 2: convergence = Iteration 3: convergence = Iteration 4: convergence = Classification completed ---- The reading file time is The classification time is The total ISODATA algorithm running time is Histogram: Class 0: Class 1: Class 2: Class 3: Class 4: Class 5: Application resources: utime ~275810s, stime ~6377s Tue Jun 12 16:07:33 EDT GB 72 GB 144 GB 216 GB 1,800 Cores 3,600 Cores 7,200 Cores 10,800 Cores  20+ hours to load data from GT into ORNL  The more cores are requested, the longer the waiting time will be  ~ 10 seconds to complete the classification process  I/O needs to be further optimized

# of tiles 5 classes 10 classes 15 classes 20 classes I/OCLSTotalIRI/OCLSTotalIRI/OCLSTotalIRI/OCLSTotalIR Scalable and High Performance Geospatial Computation (4) Parallelizing ISODATA for Unsupervised Image Classification on Kraken (2) Iterative Self-Organizing Data Analysis Technique Algorithm (ISODATA) Performance comparison  to classify one tile of 18 GB image into 10, 15, and 20 classes, ERDAS uses about 5.5, 6.5, and 7.5 hours to complete 20 iterations, while the convergence number is less than 0.95

 Through a re-engineering process, the near-repeat calculation is first parallelized on to a NVIDIA GeForce GTX 260 GPU, which takes about 48.5 minutes to complete one calculation and 999 simulations on two event chains over 30,000 events.  Through a combination of MPI and GPU programs, we can dispatch the simulation work onto multiple nodes in Keeneland to accelerate the simulation process.  We use 100 GPUs on Keeneland to implement 1,000 simulations for about 264 seconds to complete this task.  If more GPUs were used, the simulation time can be reduced. Scalable and High Performance Geospatial Computation (5) Near-repeat calculation for spatial-temporal analysis on crime events over GPU and Keeneland One run of 4+ event chain calculation is easy to approach or go beyond petascale (10 15 ) and exascale (10 18 )

Thank you Questions?