Download presentation
Presentation is loading. Please wait.
Published byAvis Newman Modified over 8 years ago
1
MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne
2
2 Motivation Streaming Applications How to automatically customize embedded Multi Processor on Chip to support efficient execution of complex algorithms? W.J. Dally, et al. 2003
3
3 Motivation © Tensilica 2007
4
4 Motivation Automatic Parallelizatio n
5
5 Motivation Automatic Parallelizatio n Automatic Customizatio n
6
6 Motivation (Parallelization) Streaming Applications Load balancing 1
7
7 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication
8
8 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication
9
9 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication
10
10 Motivation (Parallelization) Streaming Applications Load balancing Avoiding intra-processor communication Synchronization Hardware Barrier (pipelined parallelization)
11
11 Motivation Automatic Parallelizatio n Automatic Customizatio n
12
12 Motivation (Customization) Streaming Applications Instruction Set Extensions L. Pozzi, et al. 2006Tensilica, ARC, NIOS
13
13 Motivation (Customization) Streaming Applications Instruction Set Extensions Architecturally Visible Storage L. Pozzi, et al. 2006Tensilica, ARC, NIOS P. Biswas, et al. 2007 T. Kluter, et al. 2008
14
14 Motivation Automatic Parallelizatio n Automatic Customizatio n ?
15
15 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification
16
16 Motivation Automatic Parallelizatio n Automatic Customizatio n ? Only Load and Store instructions allowed in the I nstruction S et E xtension identification A rchitecturally V isible S torage memory placed between processors to form A rchitecturally V isible C ommunication buffers
17
17 Contents Motivation Parallelization Communication Automation
18
18 Parallelization (reference) Streaming Applications
19
19 Parallelization (reference) Streaming Applications T.R. Halfhill 2000
20
20 Parallelization (reference) Streaming Applications T.R. Halfhill 2000
21
21 Parallelization (reference)
22
22 Parallelization (reference) Reduced energy consumption
23
23 Parallelization (reference) Reduced energy consumption Increased performance
24
24 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006
25
25 Parallelization (reference) Reduced energy consumption Increased performance Energy of the memory subsystem only! D. Tarjan, et al. 2006
26
26 Parallelization (homogeneous) Macro block data-parallel computation due to algorithmic data dependence Theoretical speed up of 5x
27
27 Parallelization (homogeneous) time data dependence
28
28 Parallelization (homogeneous)
29
29 Parallelization (homogeneous) Higher instruction cache pressure due to five distributed copies of the complete algorithm: The system prefers a four way set associative cache over a direct mapped one
30
30 Parallelization (heterogeneous)
31
31 Parallelization (heterogeneous) Quantization is the critical execution path, however it contains easy to detect data parallelism M.I. Gordon, et al. 2006
32
32 Parallelization (heterogeneous) Entropy Encoding is the next critical execution path limiting the speed up to a factor of 4x (according to the execution on a single processor and linear speed up assumptions)
33
33 Parallelization (heterogeneous) time data dependence
34
34 Parallelization (heterogeneous)
35
35 Parallelization (heterogeneous) Reduced instruction cache pressure due to the distribution of the complete algorithm over five caches: The system prefers a 2k byte cache over a 4k byte one
36
36 Parallelization (comparison)
37
37 Contents Motivation Parallelization Communication Automation
38
38 Communication Homogeneous parallelization: Heterogeneous parallelization: Intra processor communication (10 bytes) Intra processor communication (3 x 128 bytes)
39
39 Communication (homogeneous)
40
40 Communication (homogeneous)
41
41 Communication (homogeneous)
42
42 Communication (homogeneous)
43
43 Communication (homogeneous)
44
44 Communication (homogeneous)
45
45 Communication (homogeneous)
46
46 Communication (homogeneous)
47
47 Communication (homogeneous)
48
48 Communication (homogeneous) The communication has as expected little influence on performance, and moving it to AVC buffers reduces energy consumption
49
49 Communication (heterogeneous)
50
50 Communication (heterogeneous)
51
51 Communication (heterogeneous)
52
52 Communication (heterogeneous)
53
53 Communication (heterogeneous) The communication has as expected high influence on performance, and moving it to AVC buffers reduces significantly energy consumption
54
54 Communication (summary)
55
55 Communication (summary)
56
56 Contents Motivation Parallelization Communication Automation
57
57 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } Is this pointer a data structure that can be moved to and AVC buffer?
58
58 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) Tensilica 2007
59
59 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) D.M. Gallagher 1995 Tensilica 2007
60
60 Automation void quantisation( short *buffer, short *quant_table ) { register int temp,qval; register int i; for (i = 0 ; i < DCTSIZE2 ; i++)..... } A designer can disambiguate all data structures (time consuming) A compiler might not be able to disambiguate all data structures (fast, but incomplete) Profiling can disambiguate all data structures it sees (fast, “complete”, but not guaranteed) D.M. Gallagher 1995 S. Rul, et al. 2008 W. Thies, et al. 2007 Tensilica 2007
61
61 Automation (“safe” data structures)
62
62 Automation (“unsafe” data structures) [1] T. Kluter et al. 2008
63
63 Automation (flow) 1) Disambiguate all data structures D.M. Gallagher 1995
64
64 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000
65
65 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000
66
66 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007
67
67 Automation (flow) 1) Disambiguate all data structures 2) Select all eligible data structures 3) Annotate zero communication cost 4) Perform “standard” parallelization algorithm(s) 5) Insert AVC buffers where required D.M. Gallagher 1995 Biswas, et al. 2007 L. Benini, et al. 2000 S. Rul, et al. 2008 W. Thies, et al. 2007 T.Kluter, et al. 2008 Biswas, et al. 2007
68
68 Conclusion ● Our results confirmed previous finding in automated parallelization ● Application-specific communication buffers do improve both performance and energy reduction ● Application-specific communication buffers find new automated parallelization solutions ● Application-specific communication can be used in presence of “unsafe” analysis methods
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.