Download presentation
Presentation is loading. Please wait.
Published byJocelin Anthony Modified over 9 years ago
1
PARALLEL COMPUTING Petr Štětka Jakub Vlášek Department of Applied Electronics and Telecommunications, Faculty of electrical engineering, University of West Bohemia, Czech Republic
2
About the project Laboratory of Information Technology of JINR Project supervisor Sergey Mitsyn, Alexander Ayriyan Topics Grids - gLite MPI NVIDIA CUDA 2
3
Grids – introduction 3
4
Grids II Loose federation of shared resources More efficient usage Security Grid provides Computational resources (Computing Elements) Storage resources (Storage Elements) Resource broker (Workload Management System) 4
5
gLite framework Middleware EGEE (Enabling Grids for E-sciencE) User Management (security) Users, Groups, Sites Certificate based Data Management Replication Workload Management Matching requirements against resources 5
6
gLite – User management Users Each user needs a certificate Accepts AUP Membership in a Virtual Organization Proxy certificates Applications use it on user’s behalf Proxy certificate initialization voms-proxy-init –voms edu 6
7
gLite - jobs Write job in Job Description Language Submit job glite-wms-job-submit –a myjob.jdl Check status glite-wms-job-status Retrieve Output glite-wms-job-output 7 Executable = “myapp"; StdOutput = “output.txt"; StdError = "stderr.txt"; InputSandbox = {“myapp", "input.txt"}; OutputSandbox = {"output.txt","stderr.txt"}; Requirements = …
8
Algorithmic parallelization Embarrassingly parallel Set of independent data Hard to parallelize Interdependent data, performance depends on interconnect Amdahl's law - example Program takes 100 hours Particular portion of 5 hours cannot be parallelized Remaining portion of 95 hours(%) can be parallelized => Execution can not be shorter than 5 hours, no matter how many resources we allocate. Speedup is limited up to 20× 8
9
Message Passing Interface API (Application Programming Interface) De facto standard for parallel programming Multi processor systems Clusters Supercomputers Abstracts away the complexity of writing parallel programs Available bindings C C++ Fortran Python Java 9
10
Message Passing Interface II Process communication Master slave model Broadcast Point to point Blocking or non-blocking Process communication topology Cartesian Graph Requires specification of data type Provides interface to shared file system Every process has a “view” of a file Locking primitives 10
11
MPI – Test program 11 Someone@vps101:~/mpi# mpirun -np 4./mex 1 200 10000000 Partial integration ( 2 of 4) (from 1.000000000000e+00 to 1.000000000000e+02 in 2500000 steps) = 1.061737467015e+01 Partial integration ( 3 of 4) (from 2.575000000000e+01 to 1.000000000000e+02 in 2500000 steps) = 2.439332078942e-15 Partial integration ( 1 of 4) (from 7.525000000000e+01 to 1.000000000000e+02 in 2500000 steps) = 0.000000000000e+00 Partial integration ( 4 of 4) (from 5.050000000000e+01 to 1.000000000000e+02 in 2500000 steps) = 0.000000000000e+00 Numerical Integration result: 1.061737467015e+01 in 0.79086 seconds Numerical integration by one process: 1.061737467015e+01 Numerical integration - Rectangle method top-left Input parameters: beginning, end, step. Function compiled in into the program gLite script Runs on grid
12
Test program evaluation – 4 core CPU 12
13
CUDA Programmed in C++ language Gridable GPGPU Parallel architecture Proprietary technology GeForce 8000+ FP precision PFLOPS range (Tesla) 13
14
CUDA II An enormous part of the GPU is dedicated to execution, unlike the CPU Blocks * threads represent the total number of threads that will be processed by the kernel 14
15
CUDA Test program Numerical integration - Rectangle method top-left Ported version of MPI Test program 23 times faster on a notebook NVIDIA NVS4200M than one core of Sandy Bridge i5 CPU@2.5GHz 160 times faster on a desktop GeForce GTX 480 than one core of AMD 1055T CPU@2.7GHz 15 CUDA CLI Output Integration (CUDA) = 10.621515274048 in 1297.801025 ms (SINGLE) Integration (CUDA) = 10.617374518106 in 1679.833374 ms (DOUBLE) Integration (CUDA) = 10.617374518106 in 1501.769043 ms (DOUBLE, GLOBAL) Integration (CPU) = 10.564660072327 in 30408.316406 ms (SINGLE) Integration (CPU) = 10.617374670093 in 30827.710938 ms (DOUBLE) Press any key to continue...
16
Conclusion Familiarized with parallel computing technologies Grid with gLite middleware MPI API CUDA technology Written program for numerical integration Running on grid With MPI support Also ported to Graphic card using CUDA technology It works! 16
17
THANK YOU FOR YOUR ATTENTION 17
18
Distributed Computing CPU scavenging 1997 ditstributed.net – RC5 cipher cracking Proof of concept 1999 SETI BOINC Clusters Cloud computing Grids LHC 18
19
MPI - functions Initialization Data type creation Data exchange – all process to all process 19 MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &procnum); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_Type_commit(MPI_Datatype *datatype) MPI_Finalize(); MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.