Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009.

Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Outline What’s Loongson II? What’s Loongcc? How Loongcc works, like for art. The porting process and evaluation of performance

The chip Loongson 2F in the Loongson II family Features 64-bit, Out-of-order, 4-issue, (0.8~1GHz) MIPS III-compatible On-chip 64K/64K L1 cache, 512K L2 cache On-chip MMU supporting DDR2 (533MHz)

The chip

Loongcc Yet Another Open64 branch Targeting Loongson family Aims good performance robust Open source

Loongcc

Loongcc’s transformation of art

Transformation of art Structure peeling produces temporary arrays

Structure peeling

50% cache line utilization

Structure peeling

100% cache line utilization

Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop

Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Old values always killed. No need write dirty cache lines of A to memory after used.

Temporary arrays

Need write them to memory!

Temporary arrays

Write more to memory.

Temporary arrays

Write misses!

Problems with temporary arrays Unnecessary writes to memory Large cache footprint

Temporary arrays Solution?

Problems with temporary arrays Solution? Contraction?

Temporary arrays Special pattern of visit In a loop A := i + C t := ||A|| B := t -1 A … next iteration of loop Prevents array contraction! All A need be ready before any B.

Temporary arrays Solution?

Temporary arrays Solution? Overlay

Array overlay

No write miss. Even cold miss.

Array overlay Nothing out of cache. No memory writes.

Array overlay

A is still in cache.

Array overlay No writes to memory! (as long as in cache).

Less cache footprint!

Effect of Overlay On Loongson 2F, for art

Other source-to-source transformations Array Transposition Flattening Multi-dimension array to one-dimension Structure Splitting Special loop patterns

Effect of source-to-source transformation of art.

Effect of source-to-source transformation Works good when there exists special patterns, like a hot large structure array. It works good for art and equake. Applying to other SPEC2000INT does not yield good gains (yet). It can only process C sources.

Source-to-source transformation Pros Complete information of source level Human readable intermediate results Natural representation of data structure transformations Cons Redo dataflow analysis, alias analysis, collection of frequency information. Interference with all consequent passes of optimization

Constructing Loongcc and its performance

Porting Process Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team Support full SPEC2000 SPEC2006 under work

Porting process

Performance We measure contribution of an optimization by the performance loss when the optimization is disabled.

Performance Comparison Loongcc base = -O3 –ipa Loongcc peak = follow SPEC peak rule GCC base = -O3 -march=loongson2f -mtune=loongson2f GCC peak = mild tuning of flags GFortran used.

Performance Loongcc base outperforms GCC base by 13%/35% Loongcc peak outperforms GCC peak by 28%/78% Apology that we are not real GCC experts.

SPEC2000INT

Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and more arcs in CG Dependency graph). forward-scheduling in IGLS improves gap by 8%.

Prefetch Stride prefetch improves mcf by 27% improves parser by 4% and gap by 6.3%.

Prefetch Loongson 2F has only “Pseudo Prefetch” lbu %0,addr Illegal address exception suppressed. Higher cost No effect for SPEC2000FP cases yet.

Other optimizations Use of conditional move instructions Placing affine global data near each other Peephole optimizations in EBO

SPEC2000FP

Loongcc compared to GCC Flush to zero mode Inlinin g

SPEC2000FP Array contraction Source-to- source transformation Optimizing cache behavior

Thank you! Questions please.

Answer to Questions What’s the take-home message? We develop a working, open source branch for MIPS, with good performance. We showcase that source-to-source transformation is a good way to express some optimizations.

Answer to Questions Why not CPU2006? Support is under work.

Performance comparison The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®. GFortran of corresponding version is used.

Question about source-to-source transformation The source-to-source transformation is implemented as a plugin to CIL It can only process C sources due to restriction of front-end. The frequency information has to be collected independently.

Source-to-source transformation

Recover index variable to avoid confusing Loongcc

Source-to-source transformation CIL, C Intermediate Language Source-to-source transformation framework Dataflow analysis etc. Canonicalize the C source.

Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C Use of A B C D Missing D prevents direct contraction.

Array contraction Loop 1 Def of A B C D Use of A B C D Loop 2 Def of A B C D Use of A B C D Missing D prevents direct contraction. Rematerialize D.

Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009.

Similar presentations

Presentation on theme: "Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009.

Similar presentations

Presentation on theme: "Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009."— Presentation transcript:

Similar presentations

About project

Feedback