Parallel Algorithms Lecture Notes
Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed to sort 10 million records –Size: Solve a “bigger” problem Example: multiply matrixes of big dimensions : PC with 512MB RAM, can store a max size of 8192*8192 elems of a double type of 8 bytes) Possible solution: parallelism –Split a problem into several tasks and perform these in parallel –A parallel computer: a broad definition: a set of processors that are able to work cooperatively to solve a computational problem Includes: parallel supercomputers, clusters of workstations, multiple-processor workstations
Concepts …
Logical vs physical parallelism A concurrent program, 3 processes Proces P0 Proces P1 Proces P2 0 T P0 P1 P0 P2 P1 P2 0 T P0 P1 P2 Program executed on a system with 1 processor Logical parallelism Multi-programming Program executed on a system with 3 processors Physical parallelism Multi-processing P2
concurrent-distributed-parallel
parallel distributed Parallel computingDistributed computing Most often all processors are of the same type Processors are heterogenous Most often the processors are located in the same location Processors are distributed on a wide area Overall goal = Speed (doing a job faster) Overall goal = Convenience (using resources, increasing reliability)
Parallelizing sequential code The enabling condition for doing 2 tasks in parallel: no dependences between them ! Parallelizing compilers: compile sequential programs into parallel code –Research goal since the 1970’s
Example: Adding n numbers Sequential solution: sum = 0; for (i=0; i<n; i++) { sum += A[i]; } O(n) The sequential algorithm cannot be straightforward parallelized, since every instruction depends on the previous one
Parallelizing = re-thinking algorithm ! Summing in sequence Always O(n) Summing in pairs P=1: O(n) P=n/2: O(log n)
It’s not likely a compiler will produce a good parallel code from a sequential specification any time soon… Fact: For most computations, a “best” sequential solution (practically, not theoretically) and a “best” parallel solution are usually fundamentally different … Different solution paradigms imply computations are not “simply” related Compiler transformations generally preserve the solution paradigm Therefore the programmer must discover the parallel solution !!!
Sequential vs parallel programming Has different costs, different advantages Requires different, unfamiliar algorithms Must use different abstractions More complex to understand a program’s behavior More difficult to control the interactions of the program’s components Knowledge/tools/understanding more primitive
Example: Count number of 3’s Sequential solution: count = 0; for (i=0; i<length; i++) { if (array[i]==3) count ++; } O(n)
Example: Trial solution 1 Divide array into t=4 chunks Assign each chunk to a different concurrent task identified by id=0...t-1 Code of each task: int length_per_thread = length/t; int start = id * length_per_thread; for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) count += 1; } Problem: Race condition ! This is not a correct concurrent program Accesses to the same shared mem (variable count) should be protected
Example: Trial solution 2 Correct previous trial solution by adding mutex locks in order to prevent concurrent accesses to shared variable count Code of each task: mutex m; for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { mutex_lock(m); count ++; mutex_unlock(m); } Problem: VERY slow ! There is no real parallelism, tasks wait after each other all the time
Example: Trial solution 3 Each processor adds into its own private counter, combine partial counts at the end Code of each task: for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { private_count [id] ++; } mutex_lock(m); count+=private_count[id]; mutex_unlock(m); Problem: STILL no speedup measured when using more than 1 processor ! Reason: false sharing
Example: false sharing
Example: solution 4 Forcing each private counter to be on a separate cache line, by “padding” them with “unused” locations struct padded_int { int value; char padding[128]; } private_count[MaxThreads]; Finally a speedup is measured when using more than 1 processor ! Conclusion: producing correct and efficient parallel programs can be considerably more difficult than writing correct and efficient serial programs !!!
Sequential vs parallel programming Has different costs, different advantages Requires different, unfamiliar algorithms Must use different abstractions More complex to understand a program’s behavior More difficult to control the interactions of the program’s components Knowledge/tools/understanding more primitive
Goals of Parallel Programming Performance: Parallel program runs faster than its sequential counterpart (a speedup is measured) Scalability: as the size of the problem grows, more processors can be “usefully” added to solve the problem faster Portability: The solutions run well on different parallel platforms