Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010.

Similar presentations


Presentation on theme: "Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010."— Presentation transcript:

1 Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010

2 2/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions

3 3/36 Scope CPU North Bridge South Bridge Main Memory (DDR2) FSB (Front-Side Bus) DMI (Direct Media I/F) A typical computer system (till Core 2) A typical computer system (till Core 2)

4 4/36 Scope (Cont.) CPU North Bridge South Bridge Main Memory (DDR3) Quickpath (Intel) or Hypertransport (AMD) DMI (Direct Media I/F) A Nehalem-based computer system A Nehalem-based computer system

5 5/36 Scope (Cont.) CPU North Bridge South Bridge Main Memory (DDR2) FSB DMI CPU core L1, L2 core L1, L2 … Scope of this talk

6 6/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions

7 7/36 Background Computer architecture research has been done mostly with software simulation Computer architecture research has been done mostly with software simulation –Pros Relatively easy-to-implement Relatively easy-to-implement Flexibility Flexibility Observability Observability Debuggability Debuggability –Cons Simulation time Simulation time Difficulty modeling real-world such as I/O Difficulty modeling real-world such as I/O

8 8/36 Background (Cont.) What is an alternative? What is an alternative? –FPGA (Field-Programmable Gate Array) Reconfigurability Reconfigurability –Programmable hardware –Short turn-around time High operation frequency High operation frequency Observability and debuggability Observability and debuggability Many IPs provided Many IPs provided –CPUs, memory controller, etc.

9 9/36 Background (Cont.) FPGA capability example FPGA capability example – Reconfigurable Pentium Real Pentium Reconfigurable Pentium FPGA Reconfigurable Pentium

10 10/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions

11 11/36 Related Work MemorIES (2000) MemorIES (2000) –Memory Instrumentation and Emulation System from IBM T.J. Watson –L3 Cache and/or coherence protocol emulation Plugged into 6xx bus of RS/6000 SMP machine Plugged into 6xx bus of RS/6000 SMP machine –Passive emulator

12 12/36 Related Work (Cont.) RAMP RAMP –Research Accelerator for Multiple Processors –Parallel computer architecture Multi-core HW/SW research Multi-core HW/SW research –Full emulator –Multi-disciplinary project by UC-Berkeley, Stanford, CMU, UT-Austin, MIT and Intel BEE2 board FPGAs

13 13/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions

14 14/36 Hybrid System Emulation Combination of FPGA and a real system Combination of FPGA and a real system –FPGA is deployed in a system of interest –FPGA interacts with a system Monitor transactions from the system Monitor transactions from the system Provide feedback to the system Provide feedback to the system –System-level active emulation –Run workload in a real system –Research, measure and evaluate the emulated components in a full-system configuration FPGA is deployed on FSB in this research FPGA is deployed on FSB in this research

15 15/36 Intel server system Pentium-III FPGA board Hybrid System Emulation Experiment Setup Front-side bus (FSB) Pentium-III North Bridge 2GB SDRAM Pentium-III FPGA Use an Intel server system equipped with two Pentium-IIIs Use an Intel server system equipped with two Pentium-IIIs Replace one Pentium-III with an FPGA Replace one Pentium-III with an FPGA –FPGA actively participates in transactions on FSB

16 16/36 Hybrid System Emulation Front-side Bus (FSB) FSB protocol FSB protocol –7-stage pipelined bus (Pentium-III) Request1, request2, error1, error2, snoop, response, data Request1, request2, error1, error2, snoop, response, data How FPGA participates in FSB transactions? How FPGA participates in FSB transactions? –Snoop stall Part of cache coherence mechanism Part of cache coherence mechanism Delaying the snoop response Delaying the snoop response –Cache-to-cache transfer Part of cache coherence mechanism Part of cache coherence mechanism Providing data from a processor’s cache to the requester via FSB Providing data from a processor’s cache to the requester via FSB

17 17/36 Main Memory Pentium-III(P1) Pentium-III(P0) North Bridge Hybrid System Emulation Cache Coherence Protocol Example: MESI Protocol Example: MESI Protocol –Snoop-based protocol –Intel implements MESI Modified Exclusive Shared Invalid 1234 Example E 1234S 1234 M abcd invalidation I 1234 S abcd 1. P0: read 2. P1: read 3. P1: write (abcd) 4. P0: read I ----- abcd shared “snoop stall” cache-to-cache transfer

18 18/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation ┼ –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions ┼ Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, IEEE Transactions on VLSI Systems, 2008 IEEE Transactions on VLSI Systems, 2008

19 19/36 L3 Cache Emulation Methodology L3 cache emulation methodology L3 cache emulation methodology –Implement L3 tags in FPGA –If missed, inject snoop stalls and store the information in L3 tag “New” memory access latency (= L3 miss latency) “New” memory access latency (= L3 miss latency) = snoop stalls + memory access latency –If hit, no snoop stall L3 latency (L3 hit latency) = memory access latency L3 latency (L3 hit latency) = memory access latency Front-side bus (FSB) Pentium-III North Bridge 2GB SDRAM Snoop stalls FPGA L3 TAG L1, L2 data Miss! Hit!

20 20/36 L3 Cache Emulation Experiment Environment Operating system Operating system –Windows XP Validation of emulated L3 cache Validation of emulated L3 cache –RightMark Memory Analyzer ┼ ┼ RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml

21 21/36 L3 Cache Emulation Experiment Result RightMark Memory Analyzer result RightMark Memory Analyzer result L3 Cache L2 Cache Access latency (CPU cycle) Main Memory Access latency (nsec) L1 Cache Working set size

22 22/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency ┼ –HW/SW Co-Simulation Conclusions Conclusions ┼ Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17 th FPL 2007

23 23/36 Evaluation of Coherence Traffic Efficiency Methodology Evaluation methodology Evaluation methodology –Implement an L2 cache in FPGA –Save evicted cache lines into the cache –Supply data using cache-to-cache transfer when P-III requests it next time –Measure execution time of benchmarks and compare with the baseline Front-side bus (FSB) Pentium-III(MESI) North Bridge 2GB SDRAM “cache-to-cache transfer” FPGA D$

24 24/36 Evaluation of Coherence Traffic Efficiency Experiment Environment Operating system Operating system –Redhat Linux 2.4.20-8 Natively run SPEC2000 benchmark Natively run SPEC2000 benchmark –Selection of benchmark does not affect the evaluation as long as reasonable # bus traffic is generated FPGA sends statistics information to PC via UART FPGA sends statistics information to PC via UART –# cache-to-cache transfers per second –# invalidation traffic per second

25 25/36 Evaluation of Coherence Traffic Efficiency Experiment Results Average # cache-to-cache transfers / second Average # cache-to-cache transfers / second gzip vpr gcc mcf parser gap bzip2 twolf average Average # cache-to-cache transfers/sec 804.2K/sec 433.3K/sec

26 26/36 Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.) Average execution time increase Average execution time increase –Baseline: benchmarks execution on a single P-III without FPGA data is always supplied from main memory data is always supplied from main memory 191 seconds Average execution time: 5635 seconds (93 min) 171 seconds

27 27/36 Evaluation of Coherence Traffic Efficiency Run-time Breakdown Run-time estimation with 256KB cache in FPGA Run-time estimation with 256KB cache in FPGA Invalidation trafficCache-to-cache transfer Latencies 5 ~ 10 FSB cycles 10 ~ 20 FSB cycles Estimated run-times Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline Cache-to-cache transfer is responsible for at least 33 (171-138) second increase Cache-to-cache transfer is responsible for at least 33 (171-138) second increase 69 ~ 138 seconds 381 ~ 762 seconds Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!

28 28/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation ┼ Conclusions Conclusions ┼ Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2 nd WARFP 2006

29 29/36 HW/SW Co-Simulation Motivation Gain advantages from both software simulation and hardware emulation Gain advantages from both software simulation and hardware emulation –Flexibility –High-speed Idea Idea –Offload heavy software routines into FPGA –The remaining simulator interacts with FPGA

30 30/36 HW/SW Co-Simulation Communication Method Communication between P-III and FPGA Communication between P-III and FPGA –Use FSB as communication medium –Allocate one page in memory for communication –Send data to FPGA: write-through cache mode –Receive data from FPGA: cache-to-cache transfer Front-side bus (FSB) Pentium-III(MESI) North Bridge 2GB SDRAM FPGA “write” bus transaction “cache-to-cache transfer” “read” bus transaction

31 31/36 HW/SW Co-Simulation Co-Simulation Results Preliminary experiment result with SimpeScalar for correctness checkup Preliminary experiment result with SimpeScalar for correctness checkup –Implement a simple function (mem_access_latency) into FPGA mcf bzip2 crafty eon-cook Baseline (h:m:s) Co-simulation (h:m:s) difference (h:m:s) 2:18:38 2:20:50 + 0:02:12 gcc-166 parser perl twolf 3:03:58 3:06:50 + 0:02:52 2:56:38 2:59:28 + 0:02:50 2:43:52 2:45:45 + 0:01:53 3:45:30 3:48:56 + 0:03:26 3:34:57 3:37:27 + 0:02:30 2:42:30 2:45:50 + 0:03:20 2:43:30 2:45:28 + 0:01:58

32 32/36 HW/SW Co-Simulation Analysis & Learnings Reason for the slowdown Reason for the slowdown –FSB access is expensive –Too simple function (mem_access_latency) –Device driver overhead Success criteria Success criteria –Time-consuming software routines –Reasonable FPGA access frequency

33 33/36 HW/SW Co-Simulation Research Opportunity Multi-core research Multi-core research –Implement distributed lowest level caches, and interconnection network such as ring or mesh in FPGA L3 CPU0 L1,L2 Ring I/F CPU4 L1,L2 L3 CPU1 L1,L2 Ring I/F CPU5 L1,L2 L3 CPU2 L1,L2 Ring I/F CPU6 L1,L2 L3 CPU3 L1,L2 Ring I/F CPU7 L1,L2 FPGA

34 34/36 Agenda Scope Scope Background Background Related Work Related Work Hybrid System Emulation Hybrid System Emulation Case Studies Case Studies –L3 Cache Emulation –Evaluation of Coherence Traffic Efficiency –HW/SW Co-Simulation Conclusions Conclusions

35 35/36 Conclusions Hybrid system emulation Hybrid system emulation –Deploy FPGA to a place of interest in a system –System-level active emulation –Take advantage of an existing system Presented 3 usage cases in computer architecture research Presented 3 usage cases in computer architecture research –L3 cache emulation –Evaluation of coherence traffic efficiency –HW/SW co-simulation FPGA-based emulation provides an alternative to software simulation FPGA-based emulation provides an alternative to software simulation

36 36/36 Questions, Comments? Thanks for your attention!

37 37/36 Backup Slides

38 38/36 North Bridge Evaluation of Coherence Traffic Efficiency Cache Coherence Protocol Example: MESI Protocol Example: MESI Protocol –Snoop-based protocol –Intel implements MESI Modified Exclusive Shared Invalid 1234 Example E 1234S 1234 shared M abcd invalidate I 1234 S abcd 1. P0: read 2. P1: read 3. P1: write (abcd) 4. P0: read I ----- cache-to-cache P0 P1 abcd

39 39/36 L3 Cache Emulation Motivation Software simulation has limitations Software simulation has limitations –Simulation time –Reduced dataset and workload Results could be offset by 100% or more Results could be offset by 100% or more Passive emulation has limitations Passive emulation has limitations –Monitor transactions –Impact of emulated components on system can not be modeled Full-simulation requires much more effort Full-simulation requires much more effort –Take much longer time to develop Develop a full system Develop a full system Adapt workload to a new system Adapt workload to a new system

40 40/36 L3 Cache Emulation Motivation (Cont.) Active Cache Emulation (ACE) Active Cache Emulation (ACE) –Take advantage of an existing system –Deploy an emulated component to a place of interest

41 41/36 L3 Cache Emulation HW Design Implemented modules in FPGA Implemented modules in FPGA –State machines Keep track of up to 8 FSB transactions Keep track of up to 8 FSB transactions –L3 Tags L3 in FPGA varies from 1MB to 64MB L3 in FPGA varies from 1MB to 64MB Block size varies from 32B to 512B Block size varies from 32B to 512B –Statistics module FPGA ( Xilinx Virtex-II) Front-side bus (FSB) L3 cache Tag Registers for statistics PC via UART Logic Analyzer State machine FSB pipeline 8

42 42/36 Evaluation of Coherence Traffic Efficiency HW Design Implemented modules in FPGA Implemented modules in FPGA –State machines Keep track of FSB transactions Keep track of FSB transactions –Taking evicted data from FSB –Initiating cache-to-cache transfer –Direct-mapped caches Cache size in FPGA varies from 1KB to 256KB Cache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2 –Statistics module Xilinx Virtex-II FPGA Front-side bus (FSB) Direct-mapped cache Tag Data Registers for statistics PC via UART Logic Analyzer State machine write-back cache-to-cache the rest 8

43 43/36 HW/SW Co-Simulation Implementation Hardware (FPGA) implementation Hardware (FPGA) implementation –State machines Monitoring bus transactions on FSB Monitoring bus transactions on FSB Checking bus transaction types (read or write) Checking bus transaction types (read or write) Managing cache-to-cache transfer Managing cache-to-cache transfer –Software functions to FPGA –Statistics counters Software implementation Software implementation –Linux device driver Specific physical address is needed for communication Specific physical address is needed for communication Allocate one page of memory for FPGA access via Linux device driver Allocate one page of memory for FPGA access via Linux device driver –Simulator modification for accessing FPGA

44 44/36 Comparison with SimpleScalar simulation Comparison with SimpleScalar simulation L3 Cache Emulation Experiment Results (Cont.)

45 45/36 Evaluation of Coherence Traffic Efficiency Motivation Evaluation of coherence traffic efficiency Evaluation of coherence traffic efficiency –Why important? Understand the impact of coherence traffic on system performance Understand the impact of coherence traffic on system performance Reflect into communication architecture Reflect into communication architecture –Problems with traditional methods Evaluation of protocols themselves Evaluation of protocols themselves Software simulations Software simulations Experiments on SMP machines: ambiguous Experiments on SMP machines: ambiguous –Solution A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency

46 46/36 Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.) Average increase of invalidation traffic / second Average increase of invalidation traffic / second gzip vpr gcc mcf parser gap bzip2 twolf average Average increase of invalidation traffic/sec 157.5K/sec 306.8K/sec

47 47/36 Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.) Average hit rate in the FPGA’s cache Average hit rate in the FPGA’s cache gzip vpr gcc mcf parser gap bzip2 twolf average Average hit rate (%) Hit rate = # cache-to-cache transfer # data read (full cache line) 64.89% 16.9%

48 48/36 Motivation Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with state transitions of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with state transitions of coherence protocols –Trace-based simulations were mostly used for the protocol evaluations Software simulations are too slow to perform the broad range analysis of system behaviors Software simulations are too slow to perform the broad range analysis of system behaviors –In addition, it is very difficult to do exact real-world modeling such as I/Os System-wide performance impact of coherence traffic has not been explicitly investigated using real systems System-wide performance impact of coherence traffic has not been explicitly investigated using real systems This research provides a new method to evaluate and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-shelf system and an FPGA This research provides a new method to evaluate and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-shelf system and an FPGA

49 49/36 Motivation and Contribution Evaluation of coherence traffic efficiency Evaluation of coherence traffic efficiency –Motivation Memory wall becomes higher Memory wall becomes higher –Important to understand the impact of communication among processors Traditionally, evaluation of coherence protocols focused on protocols themselves Traditionally, evaluation of coherence protocols focused on protocols themselves –Software-based simulation FPGA technology FPGA technology –Original Pentium fits into one Xilinx Virtex-4 LX200 –Recent emulation effort RAMP consortium RAMP consortium –Contribution A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency using emulation technique A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency using emulation technique MemorIES (ASPLOS 2000) BEE2 board

50 50/36 Cache Coherence Protocols Well-known technique for data consistency among multiprocessor with caches Well-known technique for data consistency among multiprocessor with caches Classification Classification –Snoop-based protocols Rely on broadcasting on shared bus Rely on broadcasting on shared bus –Based on shared memory Symmetric access to main memory Symmetric access to main memory Limited scalability Limited scalability Used to build small-scale multiprocessor systems Used to build small-scale multiprocessor systems –Very popular in servers and workstations –Directory-based protocols Message-based communication via interconnection network Message-based communication via interconnection network –Based on distributed shared memory (DSM) Cache coherent non-uniform memory Access (ccNUMA) Cache coherent non-uniform memory Access (ccNUMA) Scalable Scalable Used to build large-scale systems Used to build large-scale systems Actively studied in 1990s Actively studied in 1990s

51 51/36 Cache Coherence Protocols (Cont.) Snoop-based protocols Snoop-based protocols –Invalidation-based protocols Invalidate shared copies when writing Invalidate shared copies when writing 1980s 1980s –Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of the states (M, O, E, S, and I) Currently, adopt different combinations of the states (M, O, E, S, and I) –MEI: PowerPC750, MIPS64 20Kc –MSI: Silicon Graphics 4D series –MESI: Pentium class, AMD K6, PowerPC601 –MOESI: AMD64, UltraSparc –Update-based protocols Update shared copies when writing Update shared copies when writing Dragon protocol and Firefly Dragon protocol and Firefly

52 52/36 Cache Coherence Protocols (Cont.) Directory-based protocols Directory-based protocols –Memory-based schemes Keep directory at the granularity of a cache line in home node’s memory Keep directory at the granularity of a cache line in home node’s memory –One dirty bit, and one presence bit for each node Storage overhead due to directory Storage overhead due to directory Examples Examples –Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Origin –Cache-based schemes Keep only head pointer for each cache line in home node’ directory Keep only head pointer for each cache line in home node’ directory –Keep forward and backward pointers in caches of each node Long latency due to serialization of messages Long latency due to serialization of messages Examples Examples –Sequent NUMA-Q, Convex Exemplar, and Data General

53 53/36 Emulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s) RPM (mid-to-late ’90s) –Rapid Prototyping engine for Multiprocessor from Univ. of Southern California –ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each node, and the rest (L1, L2 etc) is implemented with 8 FPGAs A Sparc IU/FPU core is used as CPU in each node, and the rest (L1, L2 etc) is implemented with 8 FPGAs Nodes are connected through Futurebus+ Nodes are connected through Futurebus+

54 54/36 FPGA Initiatives for Evaluation Other cache emulators Other cache emulators –RACFCS (1997) Reconfigurable Address Collector and Flying Cache Simulator from Yonsei Univ. in Korea Reconfigurable Address Collector and Flying Cache Simulator from Yonsei Univ. in Korea Plugged into Intel486 bus Plugged into Intel486 bus –Passively collect –HACS (2002) Hardware Accelerated Cache Simulator from Brigham Young Univ. Hardware Accelerated Cache Simulator from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based system Plugged into FSB of Pentium-Pro-based system –ACE (2006) Active Cache Emulator from Intel Corp. Active Cache Emulator from Intel Corp. Plugged into FSB of Pentium-III-based system Plugged into FSB of Pentium-III-based system

55 55/36 Background (Cont.) Example Example

56 56/36 Intel server system Pentium-III FPGA board Logic analyzer Host PC UART Hybrid System Emulation Experiment Setup (Cont.)

57 57/36 Experimental Setup (Cont.) Xilinx Virtex-II FPGA FSB interface Logic analyzer ports LEDs

58 58/36 FSB Protocol Snoop stall Snoop stall ADS addr A[35:3]# request1 request2 error1 error2 snoop response data FSB pipeline stages HITM# HIT new transaction Snoop stalls

59 59/36 FSB Protocol Cache-to-cache transfer Cache-to-cache transfer ADS addr A[35:3]# HIT# HITM# TRDY# DRDY# DBSY# data0 D[63:0]# data2 data3data1 request1 request2 error1 error2 snoop response data FSB pipeline stages snoop-hit memory controller is ready to accept data    new transaction

60 60/36 Evaluation Methodology Goal Goal –Measure the intrinsic delay of coherence traffic and evaluate its efficiency Shortcomings in multiprocessor environment Shortcomings in multiprocessor environment –Nearly impossible to isolate the impact of coherence traffic on system performance –Even worse, there are non-deterministic factors Arbitration delay Arbitration delay Stall in pipelined bus Stall in pipelined bus “cache-to-cache transfer” shared bus Processor 0 (MESI) Memory controller Main memory Processor 1 (MESI) Processor 2 (MESI) Processor 3 (MESI)

61 61/36 Evaluation of Coherence Traffic Efficiency Run-time Breakdown Run-time estimation with 256KB cache in FPGA Run-time estimation with 256KB cache in FPGA Invalidation trafficCache-to-cache transfer Latencies 5 ~ 10 FSB cycles 10 ~ 20 FSB cycles Estimated run-times Estimated time = avg. occurrence sec x avg. total execution time x clock period cycle x latency of each traffic Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline Cache-to-cache transfer is responsible for at least 33 (171-138) second increase! Cache-to-cache transfer is responsible for at least 33 (171-138) second increase! 69 ~ 138 seconds 381 ~ 762 seconds Coherence traffic on P-III server system is NOT as efficient as main memory access

62 62/36 Conclusion Proposed a novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency Proposed a novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency –Coherence traffic in P-III-based Intel server system is not efficient as expected The main reason is that, in MESI, main memory should be updated at the same time upon cache-to-cache transfer The main reason is that, in MESI, main memory should be updated at the same time upon cache-to-cache transfer Opportunities for performance enhancement Opportunities for performance enhancement –For faster cache-to-cache transfer Cache line buffers in memory controller Cache line buffers in memory controller –As long as buffer space is available, memory controller can take data MOESI would help shorten the latency MOESI would help shorten the latency –Main memory need not be updated upon cache-to-cache transfer –For faster invalidation traffic Advancing the snoop phase to an earlier stage Advancing the snoop phase to an earlier stage

63 63/36 HW/SW Co-Simulation Motivation Software simulation Software simulation –Pros Flexible, observable, easy-to-implement Flexible, observable, easy-to-implement –Cons Intolerable simulation time Intolerable simulation time Hardware emulation Hardware emulation –Pros Significant speedup Significant speedup Concurrent execution Concurrent execution –Cons Much less flexible and observable Much less flexible and observable Low-level design taking longer time to implement and validate Low-level design taking longer time to implement and validate

64 64/36 Communication Details All FSB signals are mapped to FPGA pins All FSB signals are mapped to FPGA pins Encoding software function arguments in the FSB address for Simplescalar example Encoding software function arguments in the FSB address for Simplescalar example –For 4KB page, Set its attribute as write-through mode Set its attribute as write-through mode Lower 12 bits in FSB address bus are free to use Lower 12 bits in FSB address bus are free to use High 24 bits are used for TLB translation High 24 bits are used for TLB translation Front-side bus (FSB) Pentium-III (MESI) XilinxVirtex-II

65 65/36 HW/SW Co-Simulation Co-simulation Results Analysis FSB access is expensive FSB access is expensive –~ 20 FSB cycles ( ≈ 160 CPU cycles) for each transfer One cache line (32 bytes) needs to be transferred for cache-to-cache transfer One cache line (32 bytes) needs to be transferred for cache-to-cache transfer P-III MESI requires to update main memory upon cache-to-cache transfer P-III MESI requires to update main memory upon cache-to-cache transfer “mem_access_latency” function is too simple “mem_access_latency” function is too simple –Even software simulation takes at most a few dozen CPU cycles Device driver overhead Device driver overhead –System overhead due to device driver –It requires one TLB entry, which would be used in the simulation otherwise Time-consuming software routines and reasonable FPGA access frequency are needed to benefit from hardware implementation Time-consuming software routines and reasonable FPGA access frequency are needed to benefit from hardware implementation

66 66/36 Conclusions Proposed a new co-simulation methodology Proposed a new co-simulation methodology Preliminary co-simulation using Simplescalar proves the correctness of the methodology Preliminary co-simulation using Simplescalar proves the correctness of the methodology –Hardware/software implementation –Communication between P-III and FPGA via FSB –Linux driver Co-simulation results indicate Co-simulation results indicate –Bus access (FSB) is expensive –Linux driver overhead also needs to be overcome –Time-consuming blocks need to be emulated Multi-core co-simulation would benefit from FPGA Multi-core co-simulation would benefit from FPGA –Implement distributed low-level caches and interconnection network, which would be complex enough to benefit from hardware modeling


Download ppt "Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010."

Similar presentations


Ads by Google