P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing Technology PADS 2010, May 18, 2010
Motivation Multi-core platforms are common now Courtesy: Sun® UltraSPARC T2 Courtesy: AMD® Opeteron 6000 Courtesy: Intel® Nehalem System Simulators still sequential
Motivation Multi-core platforms are common now courtesy: Sun® UltraSPARC T2 courtesy: AMD® Phenom courtesy: Intel® Nehalem System Simulators still sequential Multi-core is wasted Simulation speed is limited by single core performance
Poor Scalability of Single-threaded Simulator Slowdown grow exponentially Not able to simulate future many-core systems cores Too slow to simulate future many-cores
Goal: fast and accurate computer system simulation Functional Cycle Accuracy Speed (slowdown) Speed (slowdown) Speedup 10x without accuracy lost
Outline Motivation Implementation Background From DES to PDES Optimization Evaluation Conclusion
Godson-T Architecture Simulator Discrete Event Simulation (DES) one global event queue event assigned to sinkers new event insert back into event queue Fine-grained EVENT A EVENT B
SimK: PDES Framework Open source Conservative PDES Highly optimized pthreads lock-free user-level thread scheduling Modularized use SimK API to implement a LP schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI LPLP LPLP LPLP LPLP LPLP …… core Host SimK LP
From DES to PDES Seperate global queue Group sinkers into logical processes(LP), 1 queue/LP Event across LPs is wrapped with PDES time router core cache PDES time wrapper router core cache LP
router 1 E.g. Router Event before PDES time wrapper router 0 core 0 cache 0 router 1 core 1 cache 1 LP 0 LP 1 router 0 core 0 cache 0 core 1 cache 1 after Event Queue Router 0 send a event to router 1
Events from DES to PDES Single-thread multi-threads Conservative PDES Simulation Time Thread 1 Thread 2 Thread 3 Thread 4 1 cycle event dependence
Grouping Into Big LPs Problem Avg. speedup is 1.8x with 16 thread (16 1-core LPs proto.) Cause of Problem too many LPs + lookahead is extremely small high sync cost Solution grouping adjacent LPs into one big LP LP
Final Parallelized version Parallel Discrete Event Simulation sinkers grouped into big LPs LPs binded to threads using SimK API time sync between LPs using PDES sched and exec under SimK framework schedule, execschedule, exec commu, sync, buffer, deploycommu, sync, buffer, deploy APIAPI core Host SimK
Outline Motivation Implementation Evaluation Accuracy Speedup Conclusion
Evaluation Setup GAS v.s. P-GAS 4 Quad-Core AMD Opteron 8347 SMP 16 cores total, 64GB Memory Benchmark: SPLASH-2 kernel count benchmark computing time in wall-clock time
Cycle Count Error Avg. cycle count error: 0.04% 16
P-GAS Speedup 16 threads, SPLASH-2 Kernel Avg. speedup is 9.8x best speedup 13.6x(LU,16 threads) 5.3x super-linear speedup with 4 threads Avg. 9.8 Max
Why super-linear speedup? More cores, more caches to use The insert-to-queue time is shorter x super-linear speedup with 4 threads
Conclusion P-GAS use PDES to speedup a cycle-accurate many-core processor simulator speedup 9.8x on a 16-core SMP cycle error < 0.04% Highly optimized conservative PDES could be used in fast and accurate system simulation multi-core/many-core processor simulation SMP cluster, many-core cluster...
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Please me the questions: Open source release of our PDES framework: