Parallelization of Iterative Heuristics for Performance Driven Low-Power VLSI Standard Cell Placement Research Committee Project Seminar Sadiq M. Sait.

Slides:

Advertisements

Similar presentations

Distributed Systems CS

Advertisements

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Computer Abstractions and Technology

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Simulated Evolution Algorithm for Multi- Objective VLSI Netlist Bi-Partitioning Sadiq M. Sait, Aiman El-Maleh, Raslan Al-Abaji King Fahd University of.

Spie98-1 Evolutionary Algorithms, Simulated Annealing, and Tabu Search: A Comparative Study H. Youssef, S. M. Sait, H. Adiche

Reference: Message Passing Fundamentals.

Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.

Fuzzy Simulated Evolution for Power and Performance of VLSI Placement Sadiq M. Sait Habib Youssef Junaid A. KhanAimane El-Maleh Department of Computer.

Finite State Machine State Assignment for Area and Power Minimization Aiman H. El-Maleh, Sadiq M. Sait and Faisal N. Khan Department of Computer Engineering.

Study of a Paper about Genetic Algorithm For CS8995 Parallel Programming Yanhua Li.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming Models and Paradigms

Fuzzy Simulated Evolution for Power and Performance of VLSI Placement Sadiq M. SaitHabib Youssef Junaid A. KhanAimane El-Maleh Department of Computer Engineering.

Parallelization of Iterative Heuristics for Performance Driven Low-Power VLSI Standard Cell Placement Research Committee Project Seminar Sadiq M. Sait.

Fast Force-Directed/Simulated Evolution Hybrid for Multiobjective VLSI Cell Placement Junaid Asim Khan Dept. of Elect. & Comp. Engineering, The University.

PGA – Parallel Genetic Algorithm Hsuan Lee. Reference  E Cantú-Paz, A Survey on Parallel Genetic Algorithm, Calculateurs Paralleles, Reseaux et Systems.

Iterative Algorithms for Low Power VLSI Placement Sadiq M. Sait, Ph.D Department of Computer Engineering King Fahd University of Petroleum.

1 General Iterative Heuristics for VLSI Multiobjective Partitioning by Dr. Sadiq M. Sait Dr. Aiman El-Maleh Mr. Raslan Al Abaji King Fahd University Computer.

Fuzzy Evolutionary Algorithm for VLSI Placement Sadiq M. SaitHabib YoussefJunaid A. Khan Department of Computer Engineering King Fahd University of Petroleum.

1 Enhancing Performance of Iterative Heuristics for VLSI Netlist Partitioning Dr. Sadiq M. Sait Dr. Aiman El-Maleh Mr. Raslan Al Abaji. Computer Engineering.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.

Network Aware Resource Allocation in Distributed Clouds.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

An Iterative Heuristic for State Justification in Sequential Automatic Test Pattern Generation Aiman H. El-MalehSadiq M. SaitSyed Z. Shazli Department.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Simulated Evolution Algorithm for Multi- Objective VLSI Netlist Bi-Partitioning Sadiq M. Sait, Aiman El-Maleh, Raslan Al Abaji King Fahd University of.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Parallel Computing Presented by Justin Reschke

1 Simulated Evolution Algorithm for Multi- Objective VLSI Netlist Bi-Partitioning Sadiq M. Sait,, Aiman El-Maleh, Raslan Al Abaji King Fahd University.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Programming in C with MPI and OpenMP

What is Parallel and Distributed computing?

Department of Computer Science University of California, Santa Barbara

Constructing a system with multiple computers or processors

Hybrid Programming with OpenMP and MPI

MPJ: A Java-based Parallel Computing System

Aiman H. El-Maleh Sadiq M. Sait Syed Z. Shazli

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4: Threads & Concurrency

Chapter 01: Introduction

Parallel Programming in C with MPI and OpenMP

Types of Parallel Computers

Presentation transcript:

Parallelization of Iterative Heuristics for Performance Driven Low-Power VLSI Standard Cell Placement Research Committee Project Seminar Sadiq M. Sait June 15, 2003 King Fahd University of Petroleum & Minerals Dhahran

Copyright © 2003Parallelization of Iterative Heuristics2 Outline Introduction Introduction Parallel CAD & Environments Parallel CAD & Environments Problem Formulation & Cost Functions Problem Formulation & Cost Functions Fuzzification of Cost Fuzzification of Cost Parallel Iterative Heuristics Parallel Iterative Heuristics Benchmarks & Experimental Setup Benchmarks & Experimental Setup Tools Tools Objectives, Tasks Objectives, Tasks Conclude Conclude

Copyright © 2003Parallelization of Iterative Heuristics3 Introduction & Motivation Problem: VLSI Cell-placement: NP-Hard Problem: VLSI Cell-placement: NP-Hard Motivation for multi-objective VLSI optimization Motivation for multi-objective VLSI optimization  Can include new issues: Power, Delay, Wirelength Motivation for using Non-deterministic Iterative Heuristics ( Simulated Annealing, Genetic Algorithm, Simulated Evolution, Tabu Search, Stochastic Evolution) Motivation for using Non-deterministic Iterative Heuristics ( Simulated Annealing, Genetic Algorithm, Simulated Evolution, Tabu Search, Stochastic Evolution)  Stable  Hill Climbing  Convergence  If engineered properly, then longer the CPU time (or more CPU power) available better the quality of solution (due to search space exploration)

Copyright © 2003Parallelization of Iterative Heuristics4 Motivation for Parallel CAD Faster runtimes to achieve same results Faster runtimes to achieve same results Larger problem sizes that have more CPU & memory requirements can be handled Larger problem sizes that have more CPU & memory requirements can be handled Better quality results Better quality results Cost-effective technology compared to large multiprocessor supercomputer Cost-effective technology compared to large multiprocessor supercomputer Utilization of abundant computer power of PCs that remain idle for a large fraction of time Utilization of abundant computer power of PCs that remain idle for a large fraction of time Availability and affordability of high speed transmission links, in the order of Gigabits/sec Availability and affordability of high speed transmission links, in the order of Gigabits/sec

Copyright © 2003Parallelization of Iterative Heuristics5 Environments Cluster of PCs (Distributed Computing) Cluster of PCs (Distributed Computing) Three Possibilities Three Possibilities  Parallel Programming Languages, e.g.  High Performance Fortran (HPF of multiprocessor)  OpenMP (another language for par-programming)  Communication Libraries (in C, C++, Fortran) for e.g.  Parallel Virtual Machine (PVM)  Message Passing Interface (MPI)  Lightweight Threads (that use OS threads for parallelizing)  POSIX (Portable Operating System Interface)  Java threads

Copyright © 2003Parallelization of Iterative Heuristics6 Problem Formulation & Cost Functions Cell Placement Problem: Cell Placement Problem:  The process of placement consists of finding suitable locations for each cell on the entire layout  By suitable location we mean those that minimize a given objective, subject to certain constraints (width) In this work we target the minimization of: In this work we target the minimization of:  Power  Delay  Wirelength Constraint (width of the layout) Constraint (width of the layout)

Copyright © 2003Parallelization of Iterative Heuristics7 Cost function for Power In standard CMOS technology, power dissipation is a function of the clocking frequency, supply voltages and the capacitances in the circuit (recall that static power is zero except for due to leakage). where P total is the total power dissipated p i is the switching probability of gate i C i represents the capacitive load of gate i f clk is the clock frequency V dd is the supply voltage and ß is a technology dependent constant

Copyright © 2003Parallelization of Iterative Heuristics8 Cost function for Delay The delay of any given long path is computed as the summation of the delays of the nets v 1, v 2,...v k belonging to that path and the switching delay of the cells driving these nets. The delay of a given path is given by where CDvi is the switching delay of the driving cell IDvi is the interconnection delay that is given by the product of load-factor of the driving cell and the capacitance of the interconnection net   IDvi = LFvi. Cvi We target only a selected set of long paths We target only a selected set of long paths

Copyright © 2003Parallelization of Iterative Heuristics9 Cost function for Wirelength Steiner tree approximation is used for each multi- pin net and the summation of all Steiner trees is considered as the estimated wire-length of the proposed solution where n is the number of cells contributing to the net where B is the length of the bisecting line, k is the number of cells contributing to the net and Dj is the perpendicular distance from cell j to the bisecting line

Copyright © 2003Parallelization of Iterative Heuristics10 Width Cost The cost of the width is given by the maximum of all the row widths in the layout The cost of the width is given by the maximum of all the row widths in the layout Layout width is constrained not to exceed a certain positive ratio α to the average row width w avg Layout width is constrained not to exceed a certain positive ratio α to the average row width w avg This can be expressed as This can be expressed as

Copyright © 2003Parallelization of Iterative Heuristics11 Fuzzification of Cost Three objectives are optimized simultaneously, with width constraint. This is done using fuzzy logic to integrate multiple objectives into a scalar cost function Three objectives are optimized simultaneously, with width constraint. This is done using fuzzy logic to integrate multiple objectives into a scalar cost function This is translated as OWA (ordered weighted average) fuzzy operator, and the membership µ(x) of a solution x is a fuzzy set given by This is translated as OWA (ordered weighted average) fuzzy operator, and the membership µ(x) of a solution x is a fuzzy set given by

Copyright © 2003Parallelization of Iterative Heuristics12 Review of Parallel Iterative Heuristics For combinatorial optimizations, three types of parallelization strategies are reported in literature For combinatorial optimizations, three types of parallelization strategies are reported in literature  The operation within an iteration of the solution method are parallelized (for example cost computation in Tabu search)  The search space (problem domain) is decomposed  Multi-search threads (for example built-in library in PVM, generates threads and distributes to different processors)

Copyright © 2003Parallelization of Iterative Heuristics13 Parallelization of Tabu Search Crainic et. al. classify Parallel Tabu Search heuristic based on a taxonomy along three dimensions Crainic et. al. classify Parallel Tabu Search heuristic based on a taxonomy along three dimensions  Control Cardinality: 1-control, p-control (1-control is one master in master slave; in p–control every processor has some control over the execution of the algorithm such as maintaining its own Tabu list, etc.)  Control and Communication type:  Rigid Synchronization (RS, slave just executes masters instruction and synchronizes)  Knowledge Synchronization (KS, more information is sent from the master to the slave)  Collegial (C, asynchronous type, power to execute is given to all processors)  Knowledge Collegial (KC, similar to above, but master gives more information to the slaves)

Copyright © 2003Parallelization of Iterative Heuristics14 Parallelization of Tabu Search Search Differentiation Search Differentiation  SPSS (single point single strategy)  SPDS (single point different strategy)  MPSS (multiple point single strategy)  MPDS (multiple point multiple strategy) Parallel Tabu Search are also classified as Parallel Tabu Search are also classified as  Synchronous  1-control:RS, or 1-control: KS  Asynchronous  p-control: C, or p-control: KC

Copyright © 2003Parallelization of Iterative Heuristics15 Related work (Parallel TS) Yamani et. al present Heterogeneous Parallel Tabu Search for VLSI cell placement using PVM Yamani et. al present Heterogeneous Parallel Tabu Search for VLSI cell placement using PVM They parallelized the algorithm on two levels simultaneously; They parallelized the algorithm on two levels simultaneously;  On the higher level, the Master starts a number of Tabu Search Workers (TSW) & provides them with a same initial solution  On the lower level, Candidate list is constructed, where each TSW starts a Candidate List Worker (CLW) Based on the taxonomy aforementioned, the algorithm can be classified as Based on the taxonomy aforementioned, the algorithm can be classified as  Higher level: p-control, Rigid Sync, MPSS  Lower level: 1-control, Rigid Sync, MPSS

Copyright © 2003Parallelization of Iterative Heuristics16 Parallel Genetic Algorithm Most of the GA parallelization techniques exploit the fact that GAs work with population of chromosomes (solutions) Most of the GA parallelization techniques exploit the fact that GAs work with population of chromosomes (solutions) The population is partitioned into sub-populations which evolve independently using sequential GA The population is partitioned into sub-populations which evolve independently using sequential GA Interaction among smaller communities is occasionally allowed Interaction among smaller communities is occasionally allowed The parallel execution model is a more realistic simulation of natural evolution The parallel execution model is a more realistic simulation of natural evolution The reported parallelization strategies are: The reported parallelization strategies are:  Island model  Stepping stone model  Neighborhood model

Copyright © 2003Parallelization of Iterative Heuristics17 Parallelization strategies for GA Island model Island model  Each processor runs sequential GA  Periodically subsets of elements are migrated  Migration is allowed between all subpopulations Stepping stone model Stepping stone model  Similar to ‘Island model’ but communication is restricted to neighboring processor only  This model defines fitness of individuals relative to other individuals in local subpopulation Neighborhood model Neighborhood model  In this model, every individual has its own neighborhood, defined by some diameter  Suitable for massively parallel machines, where each processor is assigned one element

Copyright © 2003Parallelization of Iterative Heuristics18 Related work (Parallel GA) Mohan & Mazumder report VLSI standard cell placement using a network of workstations Mohan & Mazumder report VLSI standard cell placement using a network of workstations Programs were written in C, using UNIX’s rexec (remote execution) and socket functions Programs were written in C, using UNIX’s rexec (remote execution) and socket functions They summarized their findings as: They summarized their findings as:  The result quality obtained by serial and parallel algorithms were same  Statistical observations showed that for any desired final result, the parallel version could provide close to linear speed-up  Parameter settings for migration and crossover were identified for optimal communication pattern  Static and dynamic load balancing schemes were implemented to cope with network heterogeneity

Copyright © 2003Parallelization of Iterative Heuristics19 Benchmarks ISCAS-89 benchmark circuits are used ISCAS-89 benchmark circuits are used Circuit No. of gates No. of paths S S S S S S S S S C S S S

Copyright © 2003Parallelization of Iterative Heuristics20 Experimental Setup Consists of Consists of  Hardware:  Cluster of 8 machines, x86 architecture, Pentium 4, 2GHz clock speed, 256 MB of memory  Cluster of 8 machines, x86, Pentium III, 600 MHz clock speed, 64 MB memory  Connectivity:  Gigabit Ethernet switch  100 MBit/Sec Ethernet switch  OS: Linux  Available Parallel Environments: PVM, MPI

Copyright © 2003Parallelization of Iterative Heuristics21 Tools Profiling tools Profiling tools  Intel’s VTune Performance Analyzer  Gprof (built in Unix tool)  VAMPIR (an MPI profiling tool)  upshot (built in tool with MPIch implementation) Debugging tools Debugging tools  Total View (from Etnus)  GNU debugger (gdb)

Copyright © 2003Parallelization of Iterative Heuristics22 Objectives The proposed research work will build upon our previous research that focused on design and engineering of sequential iterative algorithms for VLSI standard cell placement The primary objective of the present work is to accelerate the exploration of search space by employing the available computational resources for the same problem and with the same test cases

Copyright © 2003Parallelization of Iterative Heuristics23 Tasks outline 1. Setting up the cluster environment 2. Porting of code, followed by review and analysis of sequential implementation to identify performance bottlenecks 3. Literature review of the work related to parallelization of iterative heuristics 4. Investigation of different acceleration strategies 5. Collection of data and tools for implementation, analysis and performance evaluation 6. Design and implementation of parallel heuristics 7. Compilation of results and comparison 8. Documentation of the developed software

Copyright © 2003Parallelization of Iterative Heuristics24 Current Status Two clusters ready, switches procured, initial experiments indicate that exiting infrastructure adequate for starting of project Two clusters ready, switches procured, initial experiments indicate that exiting infrastructure adequate for starting of project All serial code initially developed on PCs/Windows now being ported to the cluster environment running Linux All serial code initially developed on PCs/Windows now being ported to the cluster environment running Linux Insight into all code is being obtained, and bottlenecks being identified using profiling tools Insight into all code is being obtained, and bottlenecks being identified using profiling tools Initial experiments conducted with Tabu search show promising results…Much more needs to be done. Initial experiments conducted with Tabu search show promising results…Much more needs to be done. Some experiments are being carried out using Genetic Algorithms. Some experiments are being carried out using Genetic Algorithms.

Copyright © 2003Parallelization of Iterative Heuristics25 Simple Trivial Implementation The first synchronous parallel tabu search implementation procedure is built according to 1-control, Rigid Synchronization (RS), SPSS strategy The first synchronous parallel tabu search implementation procedure is built according to 1-control, Rigid Synchronization (RS), SPSS strategy In this implementation, the master process executes the tabu search algorithm, while the candidate list (neighborhood size N), is divided among p slaves In this implementation, the master process executes the tabu search algorithm, while the candidate list (neighborhood size N), is divided among p slaves Each of the slaves evaluate the cost of the (N/p) moves (swapping the cells), and report the best of them to the master Each of the slaves evaluate the cost of the (N/p) moves (swapping the cells), and report the best of them to the master The master collects the solutions from all slaves and accepts the best among them to be the next initial solution, for the next iteration The master collects the solutions from all slaves and accepts the best among them to be the next initial solution, for the next iteration There is communication and exchange of information between the master and slaves after every iteration There is communication and exchange of information between the master and slaves after every iteration

Copyright © 2003Parallelization of Iterative Heuristics26 Two test circuits (c3540 & s9234)

Copyright © 2003Parallelization of Iterative Heuristics27 Observation Observations of the 1-control, RS, SPSS strategy Observations of the 1-control, RS, SPSS strategy  The total runtime decreased as the number of processors increased, in achieving the same quality of solution (except for c3540 circuit, for 2 processors, where communication was more then computation)  The speed-up in terms of time, for the large circuit (s9234) was  1.42 for 1 slave processor  1.82 for 2 slave processors  2.72 for 3 slave processors  3.50 for 4 slave processors  4.46 for 5 slave processors

Copyright © 2003Parallelization of Iterative Heuristics28 Utility value of the Project It is expected that our findings will help in addressing other NP-hard engineering problems It is expected that our findings will help in addressing other NP-hard engineering problems The results can be used by other investigators in academia and industry to enhance existing methods The results can be used by other investigators in academia and industry to enhance existing methods A lab environment setup will provide support for future research, and enable utilization of existing idle CPU resources. The environment will also help train graduate students in VLSI physical design research, iterative algorithms, and parallelization aspects. A lab environment setup will provide support for future research, and enable utilization of existing idle CPU resources. The environment will also help train graduate students in VLSI physical design research, iterative algorithms, and parallelization aspects.

Copyright © 2003Parallelization of Iterative Heuristics29 Conclusion The project aims at accelerating runtime of iterative non-deterministic heuristics by exploiting available unused CPU resources. The project aims at accelerating runtime of iterative non-deterministic heuristics by exploiting available unused CPU resources. It is expected that insight on how the implementation of heuristics behaves in a parallel environment will lead to newer parallelization strategies It is expected that insight on how the implementation of heuristics behaves in a parallel environment will lead to newer parallelization strategies Initial experiments indicate that the available environment can provide sufficient support for such type of work. Initial experiments indicate that the available environment can provide sufficient support for such type of work.