MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne

2 Motivation Streaming Applications How to automatically customize embedded Multi Processor on Chip to support efficient execution of complex algorithms? W.J. Dally, et al. 2003

3 Motivation © Tensilica 2007

4 Motivation Automatic Parallelizatio n

5 Motivation Automatic Parallelizatio n Automatic Customizatio n

6 Motivation (Parallelization) Streaming Applications Load balancing 1

7 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication

10 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication Synchronization Hardware Barrier (pipelined parallelization)

11 Motivation Automatic Parallelizatio n Automatic Customizatio n

12 Motivation (Customization) Streaming Applications Instruction Set Extensions L. Pozzi, et al. 2006Tensilica, ARC, NIOS

13 Motivation (Customization) Streaming Applications Instruction Set Extensions Architecturally Visible Storage L. Pozzi, et al. 2006Tensilica, ARC, NIOS P. Biswas, et al. 2007 T. Kluter, et al. 2008

14 Motivation Automatic Parallelizatio n Automatic Customizatio n ?

15 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification

16 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification A rchitecturally V isible S torage memory placed between processors to form A rchitecturally V isible C ommunication buffers

17 Contents Motivation Parallelization Communication Automation

18 Parallelization (reference) Streaming Applications

19 Parallelization (reference) Streaming Applications T.R. Halfhill 2000

20 Parallelization (reference) Streaming Applications T.R. Halfhill 2000

21 Parallelization (reference)

22 Parallelization (reference) Reduced energy consumption

23 Parallelization (reference) Reduced energy consumption Increased performance

24 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006

25 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006

26 Parallelization (homogeneous) Macro block data-parallel computation due to algorithmic data dependence Theoretical speed up of 5x

27 Parallelization (homogeneous) time data dependence

28 Parallelization (homogeneous)

29 Parallelization (homogeneous) Higher instruction cache pressure due to five distributed copies of the complete algorithm: The system prefers a four way set associative cache over a direct mapped one

30 Parallelization (heterogeneous)

31 Parallelization (heterogeneous) Quantization is the critical execution path, however it contains easy to detect data parallelism M.I. Gordon, et al. 2006

32 Parallelization (heterogeneous) Entropy Encoding is the next critical execution path limiting the speed up to a factor of 4x (according to the execution on a single processor and linear speed up assumptions)

33 Parallelization (heterogeneous) time data dependence

34 Parallelization (heterogeneous)

35 Parallelization (heterogeneous) Reduced instruction cache pressure due to the distribution of the complete algorithm over five caches: The system prefers a 2k byte cache over a 4k byte one

36 Parallelization (comparison)

38 Communication Homogeneous parallelization: Heterogeneous parallelization: Intra processor communication (10 bytes) Intra processor communication (3 x 128 bytes)

39 Communication (homogeneous)

48 Communication (homogeneous) The communication has as expected little influence on performance, and moving it to AVC buffers reduces energy consumption

49 Communication (heterogeneous)

53 Communication (heterogeneous) The communication has as expected high influence on performance, and moving it to AVC buffers reduces significantly energy consumption

54 Communication (summary)

55 Communication (summary)

57 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } Is this pointer a data structure that can be moved to and AVC buffer?

58 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) Tensilica 2007

59 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) D.M. Gallagher 1995 Tensilica 2007

60 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) Profiling can disambiguate all data structures it sees (fast, “complete”, but not guaranteed) D.M. Gallagher 1995 S. Rul, et al. 2008 W. Thies, et al. 2007 Tensilica 2007

61 Automation (“safe” data structures)

62 Automation (“unsafe” data structures) [1] T. Kluter et al. 2008

63 Automation (flow) 1) Disambiguate all data structures D.M. Gallagher 1995

64 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000

65 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000

66 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007

67 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) 5) Insert AVC buffers where required D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007 T.Kluter, et al. 2008 Biswas, et al. 2007

68 Conclusion ● Our results confirmed previous finding in automated parallelization ● Application-specific communication buffers do improve both performance and energy reduction ● Application-specific communication buffers find new automated parallelization solutions ● Application-specific communication can be used in presence of “unsafe” analysis methods

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

Similar presentations

Presentation on theme: "MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

Similar presentations

Presentation on theme: "MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne."— Presentation transcript:

Similar presentations

About project

Feedback