DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Independent Component Analysis (ICA) A parallel approach.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
QCAdesigner – CUDA HPPS project
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Parallel Plasma Equilibrium Reconstruction Using GPU
CS427 Multicore Architecture and Parallel Computing
Image Transformation 4/30/2009
Introduction to Parallelism.
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Speedup over Ji et al.'s work
Presentation transcript:

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian

DCABES 2009 China University Of Geosciences 2 Outline Introduction pB Calculation Formula Serial pB Calculation Process Parallel pB Calculation Models Conclusion

DCABES 2009 China University Of Geosciences 3 Part Ⅰ. Introduction Space weather forecast needs an accurate solar wind model for the solar atmosphere and the interplanetary space. The global model of corona and heliosphere is the basis of numerical space weather forecast, and the observation basis of explaining various relevant relations. Meanwhile, three-dimensional numerical Magnetohydrodynamics (MHD) simulation is one of the most common numerical methods to study corona and solar wind.

DCABES 2009 China University Of Geosciences 4 Part Ⅰ. Introduction Besides, calculating and converting the generated coronal electron density to the coronal polarization brightness (pB) is the key method of comparing with observation results, and is important to validate the MHD models. Due to the massive data and the complexity of the pB model, the computation will cost too much time to visualize the pB data in nearly real time while using a single CPU (or core).

DCABES 2009 China University Of Geosciences 5 Part Ⅰ. Introduction According to the characteristic of CPU/GPU computing environment, we analyze the pB conversion algorithm, implement two parallel models of pB calculation with MPI and CUDA, and compares the two models’ efficiency.

DCABES 2009 China University Of Geosciences 6 Part Ⅱ. pB Calculation Formula pB is derived from electron-scattered photosphere radiation. It can be used in the inversion of coronal electron density and to validate numerical models. Taking limb darkening into account, pB calculation formula of a small coronal volume element is shown as followed : (1) (2) (3)

DCABES 2009 China University Of Geosciences 7 Part Ⅱ. pB Calculation Formula The polarization brightness image for comparing with the observation of coronagraph can be generated through integrating the electron density along the line of sight. Density integral Process of pB Calculation

DCABES 2009 China University Of Geosciences 8 Part Ⅲ. Serial pB Calculation Process The steps of the serial model of pB calculation on CPU with the experimental data are shown as below. The serial process of pB calculation

DCABES 2009 China University Of Geosciences 9 Part Ⅲ. Serial pB Calculation Process According to the serial process of pB calculation above, we implement it under the environment of G95 on Linux and Visual Studio 2005 on Windows XP respectively. With being measured the time cost of each step, it is found that the most time-consuming part of the whole program is the calculation of pB values, accounting for 98.05% and 99.05% of the total time cost respectively.

DCABES 2009 China University Of Geosciences 10 Part Ⅲ. Serial pB Calculation Process Therefore, in order to improve the performance to meet the command of getting coronal polarization brightness in nearly real-time, we should optimize the calculation part of pB values. As the density integration of each point over solar limb along the line of sight is independent, the parallel computation method is very suitable for pB calculation.

DCABES 2009 China University Of Geosciences 11 Part Ⅳ. Parallel pB Calculation Models Currently, parallelized MHD numerical calculation is mainly based on MPI. With the development of high performance computation, using GPU architecture to solve intensive computation shows obvious advantages. Based on this situation, it will be an efficient parallel solution to implement the parallel MHD numerical calculation using GPU. We implement two parallel models based on MPI and CUDA respectively.

DCABES 2009 China University Of Geosciences 12 Part Ⅳ. Parallel pB Calculation Models Experiment Environment Experimental Data  42×42×82(r, θ, φ) density data(den)  321×321×481(x, y, z) cartesian coordinate grid  321×321 pB values will be generated. Hardware  Intel(R) Xeon(R) CPU, 2.00GHz(8 CPUs)  1GB memory  NVIDIA Quadro FX 4600 GPU, 760MB Global Memory GDDR3 SDRAM graphics card (It owns G80 kernel architecture, 12 MPs and 128 SPs )

DCABES 2009 China University Of Geosciences 13 Part Ⅳ. Parallel pB Calculation Models Experiment Environment Compiling Environment  CUDA-based parallel model  Visual Studio 2005 on Windows XP  CUDA 1.1 SDK  MPI-based parallel model  G95 on Linux  MPICH2

DCABES 2009 China University Of Geosciences 14 Part Ⅳ. Parallel pB Calculation Models MPI-based Parallelized Implementation In the MPI environment, how the experiment decomposes computing domain into sub-domains is shown as bellow.

DCABES 2009 China University Of Geosciences 15 Part Ⅳ. Parallel pB Calculation Models MPI-based Parallelized Implementation

DCABES 2009 China University Of Geosciences 16 Part Ⅳ. Parallel pB Calculation Models MPI-based Parallelized Implementation The final result shows that MPI-based parallel model reaches a speedup of 5.8. As the experiment is implemented under the platform with 8 CPU cores, the speed-up ratio of the result is closed to its theoretical value. Meanwhile, it is revealed that the MPI-based parallel solution for the experiment has balanced the utilization ratio of processors and the communication between processors.

DCABES 2009 China University Of Geosciences 17 Part Ⅳ. Parallel pB Calculation Models CUDA-based Parallelized Implementation According to pB serial calculation process and the CUDA architecture, we should put the calculation part into the Kernel function to implement the parallel program. Since the calculation of density interpolation and the cumulative sum involved in every pB value are independent, we can use multi-threads to process the pB value calculation in the CUDA, and each thread calculates one pB value.

DCABES 2009 China University Of Geosciences 18 Part Ⅳ. Parallel pB Calculation Models CUDA-based Parallelized Implementation However, the pB values to be calculated is much larger than the available thread number of GPU, so each thread should calculate multiple pB values. According to experimental conditions, the thread number is setting to 256 for each block so as to maximize the use of computing resources. The block number depends on the ratio of pB number and thread number. In addition, since the access time of global memory is large, we can put some independent data to the shared memory to reduce data access time.

DCABES 2009 China University Of Geosciences 19 Part Ⅳ. Parallel pB Calculation Models CUDA-based Parallelized Implementation The size of data put into shared memory is about 7KB, less than 16KB provided by GPU, so the parallel solution is feasible. Moreover, the data-length array is read-only and its using frequency is very high, so the optimized strategy that the data-length array is migrated from shared memory into constant memory is adopted to further improve its access efficiency. The CUDA-based parallel calculation process is shown as bellow.

DCABES 2009 China University Of Geosciences 20 Part Ⅳ. Parallel pB Calculation Models

DCABES 2009 China University Of Geosciences 21 Part Ⅳ. Parallel pB Calculation Models Experiment results The pB calculation time of two models is shown in Table 1. Table 1. The pB calculation time of serial models and parallel models and their speed-up ratio MPI ( G95 ) CUDA ( Visual Studio 2005 ) pB calculation time of serial models ( s ) pB calculation time of parallel models ( s ) Speed-up ratio

DCABES 2009 China University Of Geosciences 22 Part Ⅳ. Parallel pB Calculation Models Experiment results The total performance of two models is as shown in Table 2. Table 2. The total running-time of two parallel models and the speed-up ratios compared with their serial models MPI ( G95 )( s ) CUDA ( Visual Studio 2005 )( s ) The speed-up ratio of running-time Serial models Parallel models

DCABES 2009 China University Of Geosciences 23 Part Ⅳ. Parallel pB Calculation Models Experiment results Finally, we draw the coronal polarization brightness image shown as bellow with using calculated data.

DCABES 2009 China University Of Geosciences 24 Conclusion Under the same environment, pB calculation time of MPI-based parallel model costs seconds while the serial model costs seconds. The model’s speedup is The pB calculation time of CUDA-based parallel model costs seconds while the serial model costs seconds. The model’s speedup is The total running-time of CUDA-based model is 2.84 times than that of MPI-based model.

DCABES 2009 China University Of Geosciences 25 Conclusion It finds that the CUDA-based parallel model is more suitable for pB calculation, and it provides a better solution for post-processing and visualizing the MHD numerical calculation results.

DCABES 2009 China University Of Geosciences 26 Thank you!!!