Download presentation
Presentation is loading. Please wait.
Published byClement McLaughlin Modified over 9 years ago
1
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing Technology lvhuiwei@ncic.ac.cn PADS 2010, May 18, 2010
2
Motivation Multi-core platforms are common now Courtesy: Sun® UltraSPARC T2 Courtesy: AMD® Opeteron 6000 Courtesy: Intel® Nehalem System Simulators still sequential
3
Motivation Multi-core platforms are common now courtesy: Sun® UltraSPARC T2 courtesy: AMD® Phenom courtesy: Intel® Nehalem System Simulators still sequential Multi-core is wasted Simulation speed is limited by single core performance
4
Poor Scalability of Single-threaded Simulator Slowdown grow exponentially Not able to simulate future many-core systems 1000+ cores Too slow to simulate future many-cores
5
Goal: fast and accurate computer system simulation Functional Cycle Accuracy Speed (slowdown) Speed (slowdown) Speedup 10x without accuracy lost
6
Outline Motivation Implementation Background From DES to PDES Optimization Evaluation Conclusion
7
Godson-T Architecture Simulator Discrete Event Simulation (DES) one global event queue event assigned to sinkers new event insert back into event queue Fine-grained EVENT A EVENT B
8
SimK: PDES Framework Open source Conservative PDES Highly optimized pthreads lock-free user-level thread scheduling Modularized use SimK API to implement a LP schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI LPLP LPLP LPLP LPLP LPLP …… core Host SimK LP
9
From DES to PDES Seperate global queue Group sinkers into logical processes(LP), 1 queue/LP Event across LPs is wrapped with PDES time router core cache PDES time wrapper router core cache LP
10
router 1 E.g. Router Event before PDES time wrapper router 0 core 0 cache 0 router 1 core 1 cache 1 LP 0 LP 1 router 0 core 0 cache 0 core 1 cache 1 after Event Queue Router 0 send a event to router 1
11
Events from DES to PDES Single-thread multi-threads Conservative PDES Simulation Time Thread 1 Thread 2 Thread 3 Thread 4 1 cycle event dependence
12
Grouping Into Big LPs Problem Avg. speedup is 1.8x with 16 thread (16 1-core LPs proto.) Cause of Problem too many LPs + lookahead is extremely small high sync cost Solution grouping adjacent LPs into one big LP LP
13
Final Parallelized version Parallel Discrete Event Simulation sinkers grouped into big LPs LPs binded to threads using SimK API time sync between LPs using PDES sched and exec under SimK framework schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI core Host SimK
14
Outline Motivation Implementation Evaluation Accuracy Speedup Conclusion
15
Evaluation Setup GAS v.s. P-GAS 4 Quad-Core AMD Opteron 8347 SMP 16 cores total, 64GB Memory Benchmark: SPLASH-2 kernel count benchmark computing time in wall-clock time
16
Cycle Count Error Avg. cycle count error: 0.04% 16
17
P-GAS Speedup 16 threads, SPLASH-2 Kernel Avg. speedup is 9.8x best speedup 13.6x(LU,16 threads) 5.3x super-linear speedup with 4 threads Avg. 9.8 Max. 13.6 5.3
18
Why super-linear speedup? More cores, more caches to use The insert-to-queue time is shorter 18 5.3x super-linear speedup with 4 threads
19
Conclusion P-GAS use PDES to speedup a cycle-accurate many-core processor simulator speedup 9.8x on a 16-core SMP cycle error < 0.04% Highly optimized conservative PDES could be used in fast and accurate system simulation multi-core/many-core processor simulation SMP cluster, many-core cluster...
20
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Please email me the questions: lvhuiwei@ncic.ac.cn Open source release of our PDES framework: http://simk.sf.net
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.