A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform

A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform
Ahmet Cevahir ÇINAR Turkish State Meteorological Service, Konya Division Mustafa Servet KIRAN Selçuk University, Faculty of Engineering, Department of Computer Engineering, Welcome to my presentation. Thanks for your attendance. My name is Ahmet Cevahir ÇINAR. I am PhD student at Selçuk University. I work at Konya Meteorological Office. My advisor is Mustafa Servet KIRAN.

Overview Introduction CUDA Tree-Seed Algorithm(TSA)
Parallelization of TSA– {Data and Task Parallelism} Experimental Study Results This is my presentation overview. At first I give an introduction, then I explain what is CUDA and TSA. After that I explain how we parallelized algorithm. Then I tell something about experimental study. And finally give results.

accelerated up to about 150 times! Introduction What is the result?
Why parallel programming approach? What is Tree-Seed Algorithm (TSA)? Where is it implemented? How we tested? How many times run? What is the result? 1 - Because processing of big data, the parallel programming approach is an alternative way to overcome time complexity. 2 - Tree-Seed Algorithm is one of the population-based algorithms and inspired by the relations between trees and seeds. 3 - We implemented TSA within CUDA by using GPU on MATLAB. 4- To measure the processing time, serial and parallel version of TSA tested on well-known numerical benchmark functions. 5 - In the experiments, different sizes of population are considered, the dimensionality of the problems is fixed as 10 and the methods are run 30 times under these conditions. 6-The experimental results show that parallel TSA is accelerated up to about 150 times with respect to serial version of TSA. accelerated up to about 150 times!

CUDA (Compute Unified Device Architecture)
parallel computing platform and application programming interface (API) CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach termed GPGPU. The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels. The main difference CPU and GPU is Arithmetic Logic Units number. For example in this figure CPU has 4 ALU, GPU has 128 ALU. So calculation is done in parallel.

24 SM GPU: TITAN X Thread-Block-Grid Diagram Host Kernel 1 Kernel 2
Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). So Streaming Multiprocessors number is so important. In left figüre you see 2 different GPU one have 2 SM, the other one have 4 SM. This means one period 4 block executed in GPU. In our study we use a GPU which have 24 SM. So every period 24 different block executed simultaneously. Right figure shows us Thread-Block-Grid structure.

The algorithmic framework of TSA
I want to explain TSA with using algorithmic framework. 3 main phase in this algorithm. Initialization, Seed Production and Termination. At initialization part we setup onset parameters. We determine number of trees, ST parameter, how many seeds created for one tree. After that we generate solutions randomly and calculate fitness of the trees by using objective function. At seed production phase, for each tree, generate the seeds by using ST parameter and update equations. Then determine the fitness of seeds by using objective function. If the seed with best fitness is better than current tree, we remove this tree from stand and best seed placed to the stand instead of this tree. If the termination condition is met, we report the best solution.

ST(Search Tendency) is a pre-determined value in range of [0,1]
Eq.1 Eq.2 ST(Search Tendency) is a pre-determined value between 0 and 1. How it is used? At first we create a random number in range of [0,1]. If this random number is less than ST parameter, Eq.1 is used, otherwise Eq.2 is used for seed production for the tree. Search capabilities of TSA are controlled by seed production and ST parameter. jth dimension of kth seed for Tij jth dimension of ith tree scaling factor in range of [-1,1] jth dimension of the best solution obtained so far jth dimension of randomly selected tree from the stand

PARALLELIZATION OF TSA – {Data Parallelism Step}
Number of seeds for each tree = 20% of trees. For parallelization of TSA, two different phases those are data parallelism and task parallelism are considered. Random numbers which required by the TSA sent to GPU memory every iteration at the same time CURAND function library is not used for every thread so we gain extra time because calling library more expensive in terms of time. The inventor of TSA (Kıran, 2015) proposed a range (between %10 and %25 of the stand) for production number of seeds. The parallel version of the algorithm, we used the number of seeds for each tree as 20% of the stand in the experiments. For parallelization of TSA, two different phases those are data parallelism and task parallelism are considered. In the parallel version of TSA, for data parallelism random numbers which required by the TSA sent to GPU memory every iteration at the same time CURAND function library is not used for every thread so we gain extra time because calling library more expensive in terms of time.

{Task Parallelism Step}
xx = [x1, x2, ..., xd] d = length(xx); sum = 0; for ii = 1:d xi = xx(ii); sum = sum + xi^2; end y = sum; For task parallelism benchmark functions recoded separately for trees and seeds. During the iterations, computations are performed simultaneously. Sphere Function Code for Serial Tree and Seed Sphere Function Code for Parallel Seed Sphere Function Code for Parallel Tree __global__ void Objs(double * objs, double * seeds,int N,int D) { int i = blockIdx.x * blockDim.x + threadIdx.x; double total = 0.00; for (int j = 0; j < D; j++){ total=seeds[(i/N)+i+(i/N)*((N*(D-1))-1)+(j*N)]*seeds[(i/N)+i+(i/N)*((N*(D-1))-1)+(j*N)]; objs[i]=objs[i]+total; } __global__ void Obj(double * objt, double * nt,int N,int D) { int i = blockIdx.x * blockDim.x + threadIdx.x; double total = 0.00; for (int j = 0; j < D; j++){ total=(nt[i+(j*N)]*nt[i+(j*N)]); objt[i]=objt[i]+total; } For task parallelism benchmark functions recoded separately for trees and seeds. For example, green part is Sphere function for Serial Tree and Seed coded with Matlab and worked at CPU. Blue part is Sphere Function Code for Parallel Tree coded with C worked on GPU. Pink part is Sphere Function Code for Parallel Seed coded with C worked on GPU.

Trees and Seeds storage as row-major order for serial version of TSA
Tree and Seeds location as row-major order in memory for parallel version of TSA. In serial version of TSA trees and seeds placed one by one. In parallel version we placed tree and seeds one after another. In this way we make calculations simultaneous. Global best position of stand and fitness storage for serial and parallel versions of TSA same

EXPERIMENTAL STUDY To analyze performances of the serial and parallel versions of TSA, the numerical benchmark functions are used. The dimensionality of all functions is fixed as 10. For analyze performances we use 5 numerical benchmark functions. These are Sphere, Rosenbrock, Rastrigin, Griewank and Ackley. The dimensionality of all functions is fixed as 10. At the table you see range and formulas of this functions.

Serial and parallel versions of TSA are run 30 times.
ST parameter = 0.1 (Kıran, 2015). Population size = 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. Termination condition is maximum number of function evaluations (Max_FEs). Max_FEs = The number of seeds is selected as 20% of the stand size. For instance, if the size of stand is 20, 4 seeds are produced for each tree at the seed production phase. Serial and parallel versions of TSA are run 30 times. Our computer configurationis: The control parameter of the serial and parallel version of the TSA (ST parameter) is selected as 0.1 according to the first study of TSA (Kıran, 2015). Because inventor of TSA make a lot of work and suggest this number for us. The population size is taken as 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. While population size is changed, the termination condition is not changed. The termination condition in the experiments is maximum number of function evaluations (Max_FEs). Max_FEs is taken as 1.5E+5. The number of seeds is selected as 20% of the stand size and it is properly adjusted according to the number of trees in the stand. For instance, if the size of stand is 20, 4 seeds are produced for each tree at the seed production phase.

In terms of working time, the methods are compared and the comparative results are given in Table. The values in the Table shows how many times the parallel version of TSA is faster than the serial version of TSA. In the comparison table, when the stand size is taken as 10, parallel version of TSA is about 2 times faster than the serial version of TSA on Sphere function. In the comparison table, when the stand size is taken as 100, parallel version of TSA is about 150 times faster than the serial version of TSA on Rosenbrock function.

RESULTS AND DISCUSSION
Basically, the there is no complex calculations in the function such as Sphere and less number of trees in the stand, the acceleration of TSA implemented within CUDA look worse because the number of calculations which will be performed in parallel is limited. If there are a huge number of complex calculations and the number of trees in the stand is high, the performance of parallel TSA is clearly indicated. If the implementations of serial and parallel versions of TSA are performed on the different machines and GPUs, different results can be obtained but it is expected that the results are the proportionally similar.

Questions?

ACKNOWLEDGEMENTS The authors wish to thank the Coordinatorship of Scientific Research Projects (CSRP) at Selcuk University for institutional supports. In addition, the parallel TSA is run on the hardware that purchased with a research project by support of CSRP. The research project number is BAP

REFERENCES Bukharov, O. E., & Bogolyubov, D. P. (2015). Development of a decision support system based on neural networks and a genetic algorithm. Expert Systems with Applications, 42(15), Çınar, A. C. (2016). A Cuda-based Parallel Programming Approach to Tree-Seed Algorithm. (MSc Thesis), MSc Thesis, Graduate School of Natural Sciences, Selcuk University. Janousešek, J., Gajdoš, P., Radecký, M., & Snášel, V. (2014). Classification via Nearest Prototype Classifier Utilizing Artificial Bee Colony on CUDA. Paper presented at the International Joint Conference SOCO’14-CISIS’14-ICEUTE’14. Kai, Z., Ming, Q., Lin, L., & Xiaoming, L. (2015). Solving Graph Coloring Problem by Parallel Genetic Algorithm Using Compute Unified Device Architecture. Journal of Computational and Theoretical Nanoscience, 12(7), Kıran, M. S. (2015). TSA: Tree-seed algorithm for continuous optimization. Expert Systems with Applications, 42(19), Kıran, M. S. (2016). An Implementation of Tree-Seed Algorithm (TSA) for Constrained Optimization Intelligent and Evolutionary Systems (pp ): Springer. Kneusel, R. (2014). Curve-Fitting on Graphics Processors Using Particle Swarm Optimization. International Journal of Computational Intelligence Systems, 7(2), Platos, J., Snasel, V., Jezowicz, T., Kromer, P., & Abraham, A. (2012). A PSO-based document classification algorithm accelerated by the CUDA platform. Paper presented at the Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on. Solomon, S., Thulasiraman, P., & Thulasiram, R. (2011). Collaborative multi-swarm PSO for task matching using graphics processing units. Paper presented at the Proceedings of the 13th annual conference on Genetic and evolutionary computation. Wang, P., Li, H., & Zhang, B. (2015). A GPU-based Parallel Ant Colony Algorithm for Scientific Workflow Scheduling. International Journal of Grid and Distributed Computing, 8(4), Zhang, Z., & Seah, H. S. (2012). CUDA acceleration of 3D dynamic scene reconstruction and 3D motion estimation for motion capture. Paper presented at the Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on.

A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform

Similar presentations

Presentation on theme: "A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform

Similar presentations

Presentation on theme: "A Parallel Version of Tree-Seed Algorithm (TSA) within CUDA Platform"— Presentation transcript:

Similar presentations

About project

Feedback