Presentation is loading. Please wait.

Presentation is loading. Please wait.

Institute of Computing Technology, Chinese Academy of Sciences Working and Researching on Open64 Hongtao YuFeng LiWei Huo Wei MiLi ChenChunhui Ma Wenwen.

Similar presentations


Presentation on theme: "Institute of Computing Technology, Chinese Academy of Sciences Working and Researching on Open64 Hongtao YuFeng LiWei Huo Wei MiLi ChenChunhui Ma Wenwen."— Presentation transcript:

1 Institute of Computing Technology, Chinese Academy of Sciences Working and Researching on Open64 Hongtao YuFeng LiWei Huo Wei MiLi ChenChunhui Ma Wenwen XuRuiqi LianXiaobing Feng

2 Outline Reform Open64 as an aggressive program analysis tool –Source code analysis and error checking Source-to-source transformation –WHIRL to C Extending UPC for GPU cluster New targeting –Target to LOONGSON CPU

3 Part Ⅰ Aggressive program analysis

4 Whole Program analysis (WPA) Aim at Error checking A framework Pointer analysis –The foundation of other program analysis –Flow - and context-sensitive Program slicing –Interprocedural –Reduce program size for specific problems

5 Static slicer Whole Program Analyzer IPA_LINK IPL summay phase FSCS pointer analysis (LevPA) Build Call Graph Construct SSA Form for each procedure WPA Framework Static error checker

6 LevPA -- Level by Level pointer analysis A Flow- and Context-sensitive pointer analysis Fast analyzing millions of lines of code The work has been published as Hongtao Yu, Jingling Xue, Wei Huo, Zhaoqing Zhang, Xiaobing Feng. Level by Level: Making Flow- and Context-Sensitive Pointer Analysis Scalable for Millions of Lines of Code. In the Proceedings of the 2010 International Symposium on Code Generation and Optimization. April 24-28, 2010, Toronto, Canada.

7 LevPA Level by Level analysis –Analyze the pointers in decreasing order of their points-to levels Suppose int **q, *p, x; q has a level 2, p has a level 1 and x has a level 0. a variable can be referenced directly or indirectly through dereferences of another pointer. –Fast flow-sensitive analysis on full sparse SSA –Fast and accurate context-sensitive analysis using a full transfer function 7

8 Framework Figure 1. Level-by-level pointer analysis (LevPA). Evaluate transfer functions Bottom-up Top-down Propagate points-to set Compute points-to level for points-to level from the highest to lowest incremental build call graph 8

9 Example int o, t; main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; } void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; } 9 ptl(x, y, p, q) =2 ptl(a, b, c, d, e) =1 ptl(t, o) = 0 ptl(x, y, p, q) =2 ptl(a, b, c, d, e) =1 ptl(t, o) = 0 analyze first { x, y, p, q } then { a, b, c, d, e} last { t, o } analyze first { x, y, p, q } then { a, b, c, d, e} last { t, o }

10 Bottom-up analyze level 2 void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; } main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; } 10

11 Bottom-up analyze level 2 void foo( int **p, int **q) { L11: *p 1 = *q 1 ; L12: *q 1 = &obj; } main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; } 11 p 1 ’s points-to depend on formal-in p q 1 ’s points-to depend on formal-in q

12 Bottom-up analyze level 2 void foo( int **p, int **q) { L11: *p 1 = *q 1 ; L12: *q 1 = &obj; } main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x 1 = &a; y 1 = &b; L4: foo(x 1, y 1 ); L5: *b = 5; L6: if ( … ) { x 2 = &c; y 2 = &e; } L7: else { x 3 = &d; y 3 = &d; } x 4 = ϕ (x 2, x 3 ); y 4 = ϕ (y 2, y 3 ) L8: c = &t; L9: foo( x 4, y 4 ); L10: *e = 10; } 12 p 1 ’s points-to depend on formal-in p q 1 ’s points-to depend on formal-in q x 1 → { a } y 1 → { b } x 2 → { c } y 2 → { e } x 3 → { d } y 3 → { d } x 4 → { c, d } y 4 → { e, d }

13 Full-sparse Analysis Achieve flow-sensitivity flow-insensitively –Regard each SSA name as a unique variable –Set constraint-based pointer analysis Full sparse –Saving time –Saving space 13

14 Top-down analyze level 2 L4: foo.p → { a } foo.q → { b } L9: foo.p → { c, d } foo.q → { d, e } foo.p → { a, c, d } foo.q → { b, d, e } main: Propagate to callsite 14 void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; } main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

15 Top-down analyze level 2 void foo( int **p, int **q) { μ(b, d, e) L11: *p 1 = *q 1 ; χ(a, c, d) L12: *q 1 = &obj; χ(b, d, e) } foo: Expand pointer dereferences 15 Merging calling contexts here void foo( int **p, int **q) { L11: *p = *q; L12: *q = &obj; } main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x = &a; y = &b; L4: foo(x, y); L5: *b = 5; L6: if ( … ) { x = &c; y = &e; } L7: else { x = &d; y = &d; } L8: c = &t; L9: foo( x, y); L10: *e = 10; }

16 Context Condition To be context-sensitive Points-to relation c i –p v (p→v ), p must (may) point to v, p is a formal parameter. Context Condition ℂ (c 1,…,c k ) –a Boolean function consists of higher-level points-to relations Context-sensitive μ and χ – μ(v i, ℂ (c 1,…,c k )) –v i+1 =χ(v i, M, ℂ (c 1,…,c k )) M ∈ {may, must}, indicates weak/strong update 16

17 Context-sensitive μ and χ void foo( int **p, int **q) { μ(b, q  b) μ(d, q→d) μ(e, q→e) L11: *p 1 = *q 1 ; a=χ(a, must, p  a) c=χ(c, may, p→c) d=χ(d, may, p→d) L12: *q1 = &obj; b=χ(b, must, q  b) d=χ(d, may, q→d) e=χ(e, may, q→e) } 17

18 Bottom-up analyze level 1 void foo( int **p, int **q) { μ(b 1, q  b) μ(d 1, q→d) μ(e 1, q→e) L11: *p 1 = *q 1 ; a 2 =χ(a 1, must, p  a) c 2 =χ(c 1, may, p→c) d 2 =χ(d 1, may, p→d) L12: *q 1 = &obj; b 2 =χ(b 1, must, q  b) d 3 =χ(d 2, may, q→d) e 2 =χ(e 1, may, q→e) } Trans(foo, a) =,, }, pa, must > 18 Trans(foo, c) =,, }, p→c, may > Trans(foo, b) = }, { }, qb, must > Trans(foo, e) = }, { }, q→e, may > Trans(foo, d) = }, {,, }, p→d ∨ q→d, may >

19 Bottom-up analyze level 1 int obj, t; main() { L1: int **x, **y; L2: int *a, *b, *c, *d, *e; L3: x 1 = &a; y 1 = &b; μ(b 1, true) L4: foo(x 1, y 1 ); a 2 =χ(a 1, must, true) b 2 =χ(b 1, must, true) c 2 =χ(c 1, may, true) d 2 =χ(d 1, may, true) e 2 =χ(e 1, may, true) L5: *b 1 = 5; L6: if ( … ) { x 2 = &c; y 2 = &e; } L7: else { x 3 = &d; y 3 = &d; } x 4 = ϕ (x 2, x 3 ) y 4 = ϕ (y 2, y 3 ) L8: c 1 = &t; μ(d 1, true) μ(e 1, true) L9: foo(x 4, y 4 ); a 2 =χ(a 1, must, true) b 2 =χ(b 1, must, true) c 2 =χ(c 1, may, true) d 2 =χ(d 1, may, true) e 2 =χ(e 1, may, true) L10: *e 1 = 10; } 19

20 Full context-sensitive analysis Compute a complete transfer function for each procedure The transfer function maintains a low cost of being represented and applied –Represent calling contexts by calling conditions Merging similar calling contexts Better than using calling strings in reducing costs –Implement context conditions by using BDDs. compactly represent context conditions enable Boolean operations to be evaluated efficiently 20

21 Experiment Analyzes million lines of code in minutes Faster than the state-of-the art FSCS pointer analysis algorithms. Table 2. Performance (secs). 21 BenchmarkKLOC LevPA Bootstrapping(PLDI’08) 64bit32bit Icecast-2.3.1222.185.7329 sendmail11572.63143.68939 httpd12816.3235.42161 445.gombk19721.3740.78/ wine-0.9.241905502.29891.16/ wireshark-1.2.22383366.63 845.23 /

22 Future work The points-to result can be only used for error checking now We are working for –serving for optimization Let WPA framework generate codes (connect to CG) Let points-to set be accommodated for optimization passes new optimizations under the WPA framework –serving for parallelization provide precise information to programmers for guiding parallelization 22

23 An interprocedural slicer Based on PDG (Program dependence graph) Compressing PDG Merging nodes that are aliased Accommodate multiple pointer analysis Allow many problems to be solved on slice to reduce the time and space costs 23

24 Application of slice Now aiding program error checking –reduce the number of states to be checked Use Saturn as our error checker Input slices to Saturn instead of the whole program The time the error checker (Saturn) needs to detect errors in file and memory operations is 11 and 2 times faster after slicing 24

25 25

26 Application of slice Now aiding program error checking –reduce the number of states to be checked Use Saturn as our error checker Input slices to Saturn instead of the whole program The time the error checker (Saturn) needs to detect errors in file and memory operations is 11 and 2 times faster after slicing 26

27 Application of slice Now aiding program error checking –reduce the number of states to be checked Use Saturn as our error checker Input slices to Saturn instead of the whole program The time the error checker (Saturn) needs to detect errors in file and memory operations is 11.59 and 2.06 times faster after slicing –improve the accuracy of error checking tools Use Fastcheck as our error checker more true errors are detected by Fastcheck 27

28 28

29 Part Ⅱ Improvement on whirl2c 29

30 Improvement on whirl2c Previous status –Whirl2c is designed for compiler engineers of IPA and LNO to debug –Berkeley UPC group and Houston Openuh group extend whirl2c somewhat, but it still cannot support big applications and various optimizations Problem –Type Information incorrect because of transformations 30

31 Improvement on whirl2c Our work –Improve whirl2c to support recompilation of its output and execution –Pass spec2000 C/C++ programs under O0/O2/O3+IPA based on pathscale-2.2 Motivation –Some customers require us not to touch their platforms –Support the retargetability of some platform independent optimizations –Support gdb of the whirl2c output 31

32 Improvement on whirl2c Incorrect information due to transformation –Before structure folding –After structure folding Wrong output whirl2c frontend 32

33 Improvement on whirl2c Incorrect type information is mainly related to pointer/array/structure type and their compositions. We reinfer the type information correctly based on basic types –Basic type information is used to generate assembly code, so it is reliable –Array element size is also reliable –A series of rules to get the correct type information based on basic type infor, array element size infor and operators. Information useful for whirl2c but incorrect due to various optimizations is corrected just before whirl2c, which needs little change to existing IR whirl2c 33

34 Part Ⅲ Extending UPC for GPU cluster 34

35 Extending UPC with Hierarchical Parallelism UPC (Unified Parallel C), parallel extension to ISO C99 –A dialect of PGAS languages (Partitioned Global Address Language) –Suitable for distributed memory machines, shared memory systems and hybrid memory systems –Good performance, portability and programmability Important UPC features –SPMD parallelism –Shared data is partitioned to segments, each of which has affinity to one UPC thread, and shared data is referenced through shared pointer –Global workload partitioning, upc_forall with affinity expression ICT extends UPC with hierarchical parallelism –Extend data distribution for shared arrays –Hybrid SPMD with implicit thread hierarchy –Realize important optimizations targeting GPU cluster 35

36 Source-to-source Compiler, built upon Berkeley UPC(Open64) Frontend support Analysis and transformation on upc_forall loops –shared memory management based on reuse analysis –Data regroup analysis for global memory coalescing Structure splitting and array transpose –Instrumentation for memory consistency (collaborate with DSM system) –Affinity-aware loop tiling For multidimensional data blocking on shared arrays –Create data environments for kernel loop leveraging array section analysis Copy in, copy out, private (allocation), formal arguments –CUDA kernel code generation and runtime instrumentation kernel function and kernel invocation Whirl2c translator, UPC=> C+UPCR+CUDA 36

37 Memory Optimizations for CUDA What data will be put into the shared memory? –firstly pseudo tiling –Extend REGION with reuse degree and region volume inter-thread and intra-thread average reuse degree for merged region –0-1 bin packing problem (SM capacity) Quantify the profit: reuse degree integrated with coalescing attribute prefer inter-thread reuse What is the optimal data layout in global memory? –Coalescing attributes of array reference only consider contiguous constraints –Legality analysis –Cost model and amortization analysis Code transformations (in a runtime library) 37

38 Extend UPC’s Runtime System A DSM system on each UPC thread –Demand-driven data transfer between GPU and CPU –Manage all global variables –Grain size, upc tile for shared arrays and private array as a whole shuffle remote and local array region into one contiguous physical block before transferring Data transformation for memory coalescing –implemented in the GPU side using CUDA kernel –Leverage shared memory 38

39 ApplicationsDescriptionOriginal language Application field Source Nbodyn-body simulationCUDA+MPIScientific computing CUDA campus programming contest 2009 LBMLattice Boltzmann method in computational fluid dynamics CScientific computing SPEC CPU 2006 CPCoulombic PotentialCUDAScientific computing UIUC Parboil Benchmark MRI-FHDMagnetic Resonance Imaging FHD CUDAMedical image analysis UIUC Parboil Benchmark MRI-QMagnetic Resonance Imaging Q CUDAMedical image analysis UIUC Parboil Benchmark TPACFTwo Point Angular Correlation Function CUDAScientific computing UIUC Parboil Benchmark 39 Benchmarks

40  CPUs on each node: 2 dual core AMD Opteron 880  GPU: NVIDIA GeForce 9800 GX2  Compilers: nvcc (2.2) –O3 ; GCC (3.4.6) –O3 Use 4-node cuda cluster; ethernet 40 UPC Performance on CUDA cluster

41 For more details, please contact Li Chen lchen@ict.ac.cn 41

42 Part Ⅳ Open Source Loongcc 42

43 Open Source Loongcc Target to LOONGSON CPU Base on Open64 –Main trunk -- r2716 A MIPS-like processor –Have new instructions New Features –LOONGSON machine model –LOONGSON feature support FE, LNO, WOPT, CG –Edge profiling 43

44 Thanks 44


Download ppt "Institute of Computing Technology, Chinese Academy of Sciences Working and Researching on Open64 Hongtao YuFeng LiWei Huo Wei MiLi ChenChunhui Ma Wenwen."

Similar presentations


Ads by Google