STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle –Data used (single, multiple) per clock cycle Single Instruction Single Data: Serial computing Single Instruction Multiple Data: Multiple processors, GPU Multiple Instruction Single Data: Shared memory MIMD: Cluster computing, Multi-core CPU, Multi-threaded, Message-passing (IBM SP-x on hypercube, Intel single chip Xenon Phi: future-of-supercomputing )
Grid Computing & Cloud Not necessarily parallel Primary focus is the utilization of CPU-cycles across Just networked CPU’s, but middle-layer software makes node utilizations transparent A major focus: avoid data transfer – run codes where data are Another focus: load balancing Message passing parallelization is possible: MPI, PVM, etc. Community specific Grids: CERN, Bio-grid, Cardio-vascular grid, etc. Cloud: Data archiving focus, but really commercial versions of Grid, CPU utilization is under-sold but coming up: expect service-oriented software business model to pick up
RAM Memory Utilization Two types feasible: Shared memory: Fast, possibly on-chip, no message passing time, no dependency on a ‘pipe’ and its possible failure But, consistency needs to be explicitly controlled, that may cause-deadlock, that needs deadlock checking-breaking mechanism adding overhead Distributed local memory: communication overhead ‘pipe’ failure possibility is a practical problem good model where threads are independent of each other most general model for parallelization easy to code, & well-established library (MPI) scaling up is easy – on-chip to over-the-globe
Threading Types Two types feasible: Static threading: OS controls, typically for single-core CPU’s (why would one do it? - OS), but multi-core CPU’s use it if compiler guarantees safe execution Dynamic threading: Program controls explicitly, threads are created/destroyed as needed, parallel computing model
Multi-threaded Fibonacci Recursive Fib (n) 1If n<=1 then return n; else 2. x = Fib(n-1); 3. y = Fib(n-2); 4. return (x+y). Complexity: O(G n ), where G is Golden ration ~1.6
Fibonacci Recursive Fib (n) 1If n<=1 then return n; else 2. x = Spawn Fib(n-1); 3. y = Fib(n-2); 4.Sync; 5. return (x+y). Parallelization of threads is optional: scheduler decides (programmer, script translator, compiler, os)
GPU-type parallelization’s ideal time ~critical path length The more balanced the tree is the shorter the critical path Spawn, or Data collection node is counted as time unit 1 This is message passing Note, GPU/SIMD uses different model: Each thread does same work (kernel), & Data goes to shared memory
Terminologies/Concepts For P available processor: T inf, T P, T 1 : no-limit to serial-processor Ideal parallelization: T P = T 1 / P Real situation: T P >= T 1 / P T inf is theoretical minimum feasible, so, T P >= T inf Speedup factor = T 1 / P T 1 / T P <= P Linear speedup: T 1 / T P = O(P) [e.g. 3P +c] Perfect linear speedup: T 1 / T P = P My preferred factor would be T P / T 1 (inverse speedup: slowdown factor?) –linear O(P); quadratic O(P 2 ), …, exponential O(k P, k>1)
Terminologies/Concepts For P available processor: T inf, T P, T 1 : no limit to serial processor Parallelism factor: T 1 / T inf –serial-time by ideal-parallelized-time –note, this is about your algorithm, unoptimized over the actual configuration available to you T 1 / T inf < P implies NOT linear speedup T 1 / T inf << P implies processors are underutilized We want to be close to P: T 1 / T inf P, as in limit Slackness factor: (T 1 / T inf ) / P, or (T 1 / T inf P) We want slackness 1, minimum feasible –i.e, we want no slack