A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer
2 Research Contents Background Problem Definition and Project Goal Splitting – Producer Selection – Inter-process Communication – Consumer Selection Implementation Conclusion And Further Work
3 Research Background Parallelization is not new Forking a sequential application Classic example, matrix-matrix multiplication: – Master processor executes code up to parallel loop – Execute parallel iterations on other processors – Synchronize at end of parallel loop CPU Cache CPU Cache CPU Cache CPU Cache Memory Bus Main Memory : do i, n do j, n a(i,j) = … : end do :
4 Research Background Applications are specified as parallel tasks: Example JPEG decoder :
5 Research Cakesim (eCos+CCP) - profile for JPEG-KPN: Problem Definition ?
6 Research Problem Definition Automatic procedure for process splitting in KPNs to take advantage of multiprocessor architectures. Original process network Split-up network:
7 Research Splitting – The Concept Required: Determine computational expensive process: profiling or pragma’s + static support Partitioning of the Iteration Space (IS) N = number of times a process has to be split L = loop-nest level at which the splitting takes place To do: Duplication of code and FIFOs Adding control for token production and consumption
8 Research Techniques used: Data dependence analysis: – Data flow analysis – Array data flow analysis Tree transformations: – Adding/removing/duplicating tree statements Compiler framework: – GCC
9 Research Solution for KPNs Four step approach: COMPUTATION: 1.Partitioning (computation) COMMUNICATION: 2. Interprocess communication 3. Token production 4. Token consumption P1P2P3 P1P21P3 P22
10 Research Partitioning of the original process computation over the resulted split-up processes
11 Research Interprocess Communication : for(int i=1; i<10; i++) a[i] = a[i-1] + i; //s1 : Inter process communication is given by the loop-carried dependency: a[i-1] at iteration i is produced at iteration i-1. If execution of stmt s1 is distributed over different processes, token a[i-1] needs to be communicated:: for(int i=1; i<10; i++){ if(i%2==0) if(i%2==1) a[i] = a[i-1] + i; a[i] = a[i-1] + i;:
12 Research Problems: –P1.At the producer side: where to send the tokens to? –PII.At the consumer side: from where to consume tokens ? Solutions P1: 1.Producer filters the tokens (static solution) 2.Producer sends all tokens to all split-up processes (run time solution) Solutions PII: 1.The consumer knows by it self when to switch (static solution) 2.Each producer sends a signal to the consumer when to switch reading data from a different FIFO (run time solution) Token Production&Consumption ? P1P2P1 P2’ P2’’ ? P2’ P2’’ P2P3
13 Research Token Production– runtime vs. static 100 tokens P1P2 50 tokens Static solution P1 P2’ P2’’ Runtime solution P1 P2’ P2’’ 100 tokens
14 Research Token Consumption – runtime vs. static 100 tokens P2P3 Switch is known internally by the consumer 50 tokens Static solution P2’ P2’’ P2 Switch is communicated over the channels to the consumer 50 tagged tokens 50 tagged tokens Runtime solution P2’ P2’’ P3
15 Research Token Production & Consumption – static solution Establish the data-dependencies over the processes HOW? Data Dependence function (DD) and DD -1 DD -1 : Producer Consumer DD : Consumer Producer However, DD cannot always be determined at compile time
16 Research Token Production – static solution without DD -1 Observation: loop counters producer side equal loop counters from consumer side
17 Research Token Production – static solution without DD -1 DD -1 (w1,w2,w3)=(w4,w5,w6); P2(DD -1 (w1,w2,w3))=w5 w5=w2 => P2(DD -1 (w1,w2,w3)%2= w2%2
18 Research Token Consumption – static solution without DD Similar to production of tokens.
19 Research Runtime solution:
20 Research Multiple split-up processes Split-up into 3 processes P1P2 P3P4 P1P2’’P3’’ P4 P2’ P3’ P2’’’ P3’’’
21 Research Copy-nodes P1P2 P3P4 Copy-nodes insertion P1 P2 P3 P4 Splitting transformation P1 P2’’ P3’’ P4 P2’ P3’ P2’’ P3’’
22 Research Copy-nodes Pros: –Simple network structure –Apply four-step splitting approach Cons: –More processes => more communication (can be improved) => overhead
23 Research Implementation Used technique: –Runtime solution (general) Used framework: –GCC (GNU Compiler Collection) Advantages GCC: – Availability of data dependence information – Supported by large community; – We are in contact with Sebastian Pop, maintainer and developer of various compiler phases e.g. the data dependence analysis, control flow and induction variable.
24 Research Implementation Data dependence analysis (already present): – scalars – arrays Data Dependence Graph (DDG) present only on RTL level, not on tree SSA Two new passes: 1.Create DDG 2.Splitting
25 Research Implementation Function foo() { : //stmt1 : for () { //stmt2 } : //stmt3 : } Data Dependence ? Check DDG If no loop-carried data dependence Modify Tree/CFG: duplicate basic blocks create if-condition
26 Research Implementation 1.Splitting pragma 2.Data dependence graph 3.Class definition reconstruction 4.Function cloning 5.Modulo condition insertion
27 Research Implementation To do: 1.Copying of class definition 2.Copying of class member functions 3.Reconstruction network structure –FIFO –Network definition
28 Research Implementation Final result: Data dependence information tells whether splitting is legal (no IPC) Semi-automatic transformation/case- study
29 Research Results Improvement of 21% Original KPN KPN with copy nodes Processes split-up into two
30 Research Future work: YAPI and CCP Difference in active and passive connectors. Active connectors in YAPI are modeled as a thread Passive do not run in a separate thread More connectors in CCP: P1 P2’’ P3’’ P4 P2’ P3’ P2’’ P3’’ Mesh Merge Fork
31 Research Future Work Connect GCC with SCOTTY: GCC branch –Main branch: may not accept the patch –GOMP branch targets parallelization + data dependence + Network topology
32 Research Conclusion Only split-up the most computationally expensive processes The transformation is profitable
34 Research Building threads within sequential applications Another transformation: creation of threads Why another transformation: Widely applicable, also out of the context of KPNs
35 Research Technique pthread_t thread1, thread2; void pfunc1() { for(i=0; i<N; i++){ if(i%2==0){ b[i] = sqrt(a[i]) + sqrt(a[i]*2) / (N*i) + log(i+1); } pthread_exit(0); } int main(){ int i; const int N=200; float a[N], b[N]; for(i=0; i<N; i++) a[i] = i; for(i=0; i<N; i++) b[i] = sqrt(a[i]) + sqrt(a[i]*2) / (N*i) + log(i+1); return 0; } pthread_create( &thread1, NULL, (void *) &pfunc1, NULL); pthread_create( &thread2, NULL, (void *) &pfunc2, NULL);
36 Research Technique Process splitting Thread creation Process splitting: extra input and output FIFOs Threads: – Competing for input tokens – Unknown running time, result: output order of tokens is not respected.
37 Research Example – put it all together
38 Research
39 Research Data Dependencies –True/flow dependency S1 f S2 S1 X = … S2 … = X – Output dependency S1 o S2 S1 X = … S2 X = … – Anti-dependency S1 a S2 S1 … = X S2 X = …