Download presentation
Presentation is loading. Please wait.
1
Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009
2
Outline What’s Loongson II? What’s Loongcc? How Loongcc works, like for art. The porting process and evaluation of performance
3
The chip Loongson 2F in the Loongson II family Features 64-bit, Out-of-order, 4-issue, (0.8~1GHz) MIPS III-compatible On-chip 64K/64K L1 cache, 512K L2 cache On-chip MMU supporting DDR2 (533MHz)
4
The chip
6
Loongcc Yet Another Open64 branch Targeting Loongson family Aims good performance robust Open source
7
Loongcc
8
Loongcc’s transformation of art
9
Transformation of art Structure peeling produces temporary arrays
10
Structure peeling
11
50% cache line utilization
12
Structure peeling
13
100% cache line utilization
14
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop
15
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop
16
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop
17
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop
18
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Old values always killed. No need write dirty cache lines of A to memory after used.
19
Temporary arrays
21
Need write them to memory!
22
Temporary arrays
23
Write more to memory.
24
Temporary arrays
25
Write misses!
26
Problems with temporary arrays Unnecessary writes to memory Large cache footprint
27
Temporary arrays Solution?
28
Problems with temporary arrays Solution? Contraction?
29
Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Prevents array contraction! All A need be ready before any B.
30
Temporary arrays Solution?
31
Temporary arrays Solution? Overlay
32
Array overlay
34
No write miss. Even cold miss.
35
Array overlay Nothing out of cache. No memory writes.
36
Array overlay
37
A is still in cache.
38
Array overlay No writes to memory! (as long as in cache).
39
Less cache footprint!
40
Effect of Overlay On Loongson 2F, for art
41
Effect of Overlay On Loongson 2F, for art
42
Effect of Overlay On Loongson 2F, for art
43
Effect of Overlay On Loongson 2F, for art
44
Effect of Overlay On Loongson 2F, for art
45
Other source-to-source transformations Array Transposition Flattening Multi-dimension array to one-dimension Structure Splitting Special loop patterns
46
Effect of source-to-source transformation of art.
47
Effect of source-to-source transformation Works good when there exists special patterns, like a hot large structure array. It works good for art and equake. Applying to other SPEC2000INT does not yield good gains (yet). It can only process C sources.
48
Source-to-source transformation Pros Complete information of source level Human readable intermediate results Natural representation of data structure transformations Cons Redo dataflow analysis, alias analysis, collection of frequency information. Interference with all consequent passes of optimization
49
Constructing Loongcc and its performance
50
Porting Process Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team Support full SPEC2000 SPEC2006 under work
51
Porting process
52
Performance We measure contribution of an optimization by the performance loss when the optimization is disabled.
53
Performance Comparison Loongcc base = -O3 –ipa Loongcc peak = follow SPEC peak rule GCC base = -O3 -march=loongson2f -mtune=loongson2f GCC peak = mild tuning of flags GFortran used.
54
Performance Loongcc base outperforms GCC base by 13%/35% Loongcc peak outperforms GCC peak by 28%/78% Apology that we are not real GCC experts.
55
SPEC2000INT
57
Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and more arcs in CG Dependency graph). forward-scheduling in IGLS improves gap by 8%.
58
Prefetch Stride prefetch improves mcf by 27% improves parser by 4% and gap by 6.3%.
59
Prefetch Loongson 2F has only “Pseudo Prefetch” lbu %0,addr Illegal address exception suppressed. Higher cost No effect for SPEC2000FP cases yet.
60
Other optimizations Use of conditional move instructions Placing affine global data near each other Peephole optimizations in EBO
61
SPEC2000FP
62
Loongcc compared to GCC Flush to zero mode Inlinin g
63
SPEC2000FP Array contraction Source-to- source transformation Optimizing cache behavior
64
Thank you! Questions please.
65
Answer to Questions What’s the take-home message? We develop a working, open source branch for MIPS, with good performance. We showcase that source-to-source transformation is a good way to express some optimizations.
66
Answer to Questions Why not CPU2006? Support is under work.
67
Performance comparison The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®. GFortran of corresponding version is used.
68
Question about source-to-source transformation The source-to-source transformation is implemented as a plugin to CIL It can only process C sources due to restriction of front-end. The frequency information has to be collected independently.
69
Source-to-source transformation
71
Recover index variable to avoid confusing Loongcc
72
Source-to-source transformation CIL, C Intermediate Language Source-to-source transformation framework Dataflow analysis etc. Canonicalize the C source.
73
Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C Use of A B C D Missing D prevents direct contraction.
74
Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C D Use of A B C D Missing D prevents direct contraction. Rematerialize D.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.