Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,

Similar presentations


Presentation on theme: "A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,"— Presentation transcript:

1 A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) ISPASS 2012 April 2, 2012

2 Background Memory behavior is the key factor of the performance of a program. Understanding memory behavior is significant for identifying the bottleneck of both architecture and application. For example, – TLB is an essential component of memory system – Applications’ working set tends to be larger and lager, leading to serious TLB miss – Study 1: that TLB miss can degrade system performance by 5~14% [Bhargava’08] – Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08] Done by memory profiling

3 Memory Profiling Memory profiling is to collect memory behavior information during the execution of programs. Profiling can be performed for – different hardware components – at different software levels TLB/Cache/DRAM Objects (Array, List etc.) Function Application Whole System

4 Object Memory Profiling Object refers to a group of data stored as a unit [Wu’04] – Distinguish regular patterns from mixed and irregular traces Valuable for optimization – Memory trace compression – Data layout – Object-level prefetching – Cache partition [Soft-OLP, PACT 2009] Whole System Traces Application Traces Object Trace Irregular Regular

5 Current Profiling Approaches Existing approaches – Compiler-driven: re-compile/re-link, source code – Instrumentation: heavy overhead – Simulation: accuracy problem, slow – Performance Counter: lack of detailed information All cannot observe page table walks due to TLB Miss We propose a hybrid hardware/software approach for object memory profiling – Accurate: real application & real system – Lightweight – Track page table walks at object-level

6 Outline Background Design and Implementation Experimental Results Conclusion

7 An Overview Object Access Pattern Matrix (VA: 0x1f05000) 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… Virtual Address Trace 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Physical Address Trace

8 HMTT Hybrid Memory Trace Toolkit – A DDR3 SDRAM compatible memory trace monitoring system – Adopts hardware snooping technology DIMM plugged on the other side PCIE Cable Connector Memory Trace: Advantages: Platform independent Negligible overhead Full-system real memory traces, including OS, page table walks

9 Challenges (1) How to translate physical address trace to virtual address trace of a specific process? Modify OS kernel to obtain page table Lookup a phy_addr in the dumped page table Generate virtual trace of each process

10 Challenge (2) How to synchronize hardware and software when an page table update occurs in kernel? Physical Page allocation/Free in kernel Trigger annotations in OS VM module Update dumped page table Send a sync_tag to hardware

11 Challenge (3) How to translate virtual address to objects without modifying source codes? matrix = malloc(0x1000) Object: matrix Virtual Address Space matrix = mymalloc(0x1000) Object-VA Mapping Table The role of malloc() is to map VA to object Use dynamic library overwrite to replace malloc()

12 Put them all together Object Access Pattern Matrix (VA: 0x1f05000) 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… Virtual Address Trace 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Physical Address Trace Object-VA Mapping Table Dumped Page Table sync_tag page walk Use page table to distinguish three types of memory access Sync_tag  update page table Access page table itself  page table walk due to TLB miss Other memory access  virtual address

13 Evaluation Methodology Processor Intel Xeon E5504, 2.0GHz, 2 Sockets, 4 Cores per Socket (8 core in total) Private Cache L1 D-Cache: 32KB, 8-way, 64Byte/line I-Cache: 32KB, 4-way, 64Byte/Line L2256KB, 8-way, 64Byte/line Shared CacheL34MB, 16-way, 64Byte/line TLB (private) DTLB0 64 entries for 4-KByte pages 32 entries for huge pages (2MByte) TLB1512 entries for 4-KByte pages Memory DDR3-800 RDIMM, dual-rank, plugged into Socket 0, 4GB 0.25GB reserved for HMTT configuration and buffer 3.75GB system available Operating SystemCentOS 5.3, Linux kernel 2.6.32.18 Benchmarks Multithreaded PARSEC 2.1 A custom hybrid MPI/pthread implemented BFS of Graph500-1.2

14 Validation For SpMV benchmark (CSR) : y = ax * xhost Our system is able to distinguish regular access pattern from irregular pattern Micro-benchmark: —The error is less than 2%

15 Overhead Two main overhead: – Dumping page table traces: + dump_pt – Dumping object-VA mapping: + dump_obj Monitoring objects >= 4KB: result in most memory references <1% <2%

16 Case Study 1: BFS (Breadth-First Search) column object got about 71% of page walks  key object Optimization: use huge page for column object – Speedup: about 12% for 8-thread, 8% for 128-thread 8.18%

17 Case Study 2: Canneal (PARSEC) Cache-aware simulated annealing (SA) to minimize the routing cost of a chip design Two objects contribute most of the memory accesses: _elements and _location The memory access almost do not change while increasing thread number.

18 Case Study 2: Canneal _elements object contributes the most of the increased page walks Put the _elements object into huge page to reduce TLB miss  Speedup: about 5% for 8-thread

19 A Visual Demo of the HMTT

20 Conclusion We have designed and implemented a hybrid hardware/software approach to conduct object- relative memory profiling. – Accurate: real application & real system – Lightweight – Track page table walks at object-level We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance. We intend to use this approach to analyze virtual machine on real machines.

21 Thanks ! &Questions?

22 Extra Slides

23 Memory Profiling Approaches AccurateDetailed Low overhead Page walks + Instrument√√×× Simulator*√×× Performance Counter √×√* Compiler√√√× Hybrid H/S√√√√ Note: √-Yes, ×-No, *-Maybe

24 Reverse Page Table Physical address  pid, virtual address

25 Validation ObjReadWriteRate PerError a04,194,37004:0 0% a14,194,3101,048,5764:1 0% a24,194,3692,096,9274:2 0% a34,194,3033,087,3794:2.94 4:32.04% a44,194,4364,149,5864:3.96 4:41.01% Access objects with different pattern: a0: all read accesses, forward a1: 3/4 read and 1/4 write accesses, forward a2: 2/4 read and 2/4 write accesses, forward a3: 1/4 read and 3/4 write accesses, backward a4: all write accesses, backward a0 a4 Size 256MB, access step 64B, requests: 4M

26 HMTT Configuration Space A reserved physical memory region Can be accessed by source codes and binary codes


Download ppt "A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,"

Similar presentations


Ads by Google