Presentation is loading. Please wait.

Presentation is loading. Please wait.

DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2.

Similar presentations


Presentation on theme: "DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2."— Presentation transcript:

1 DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2

2 DAP Spr.‘98 ©UCB 2 Project and Test 2 Think what you can change in CPU or cache architecture to speed up the executions for network applications. Modify that part in Simple Scalar, rerun your applications and compare with your results in project 1. Test 2: 40 points => 80 mins => About 2 mins per point => gives you an idea about the time you spend on a question. Test has 4 questions with several parts, and the points are noted. Answer precisely and briefly.

3 DAP Spr.‘98 ©UCB 3 Minimizing Stalls Technique 1: Compiler Optimization 6 clocks InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Swap BNEZ and SD by changing address of SD 1 Loop:LDF0,0(R1) 2SUBIR1,R1,8 3ADDDF4,F0,F2 4Stall 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;Address altered from 0(R1) to 8(R1) when moved past SUBI

4 DAP Spr.‘98 ©UCB 4 Compiler Technique 2: Loop Unrolling 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 ;1 cycle delay * 3SD0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4LDF6,-8(R1) 5ADDDF8,F6,F2 ; 1 cycle delay 6SD-8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7LDF10,-16(R1) 8ADDDF12,F10,F2 ; 1 cycle delay 9SD-16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10LDF14,-24(R1) 11ADDDF16,F14,F2 ; 1 cycle delay 12SD-24(R1),F16 ; 2 cycles daly 13SUBIR1,R1,#32;alter to 4*8; 1 cycle delay 14BNEZR1,LOOP ; Delayed branch 15NOP *1 cycle delay for FP operation after load. 2 cycles delay for store after FP. 1 cycle after SUBI. 15 + 4 x (1+2) + 1 = 28 clock cycles, or 7 per iteration Loop Unrolling is essential for ILP Processors Why? But, increase in Code memory and no. of registers.

5 DAP Spr.‘98 ©UCB 5 Minimize Stall + Loop Unrolling What assumptions made when moved code? –OK to move store past SUBI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F12 12SUBIR1,R1,#32 13BNEZR1,LOOP ; Delayed branch 14SD8(R1),F16; 8-32 = -24 14 clock cycles, or 3.5 per iteration

6 DAP Spr.‘98 ©UCB 6 Very Long Instruction Word: VLIW Architectures Wide-issue processor that relies on compiler to –Packet together independent instructions to be issued in parallel –Schedule code to minimize hazards and stalls Very long instruction words (3 to 8 operations) –Can be issued in parallel without checks –If compiler cannot find independent operations, it inserts nops Advantage: simpler HW for wide issue –Faster clock cycle –Lower design & verification cost Disadvantages: –Code size –Requires aggressive compilation technology

7 DAP Spr.‘98 ©UCB 7 VLIW and Superscalar sequential stream of long instruction words instructions scheduled statically by the compiler number of simultaneously issued instructions is fixed during compile-time instruction issue is less complicated than in a superscalar processor Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. The number of instructions in a VLIW instruction word is usually fixed. Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. VLIW is an architectural technique, whereas superscalar is a microarchitecture technique. VLIW processors take advantage of spatial parallelism.

8 DAP Spr.‘98 ©UCB 8 Multithreading How can we guarantee no dependencies between instructions in a pipeline? –One way is to interleave execution of instructions from different program threads on same pipeline – Micro context switching Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

9 DAP Spr.‘98 ©UCB 9 Simple Multithreaded Pipeline Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

10 DAP Spr.‘98 ©UCB 10 Comparison of Issue Capabilities Courtesy of Susan Eggers; Used with Permission

11 DAP Spr.‘98 ©UCB 11 From Superscalar to SMT Small items –per-thread program counters –per-thread return stacks –per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush –thread identifiers, e.g., with BTB & TLB entries

12 DAP Spr.‘98 ©UCB 12 Typical NP Architecture SDRAM (Packet buffer) SRAM (Routing table) multi-threaded processing elements Co-processor Input ports Output ports Network Processor Bus

13 DAP Spr.‘98 ©UCB 13 Why Network Processors Current Situation –Data rates are increasing –Protocols are becoming more dynamic and sophisticated –Protocols are being introduced more rapidly Processing Elements –GP(General-purpose Processor) »Programmable, Not optimized for networking applications –ASIC(Application Specific Integrated Circuit) »high processing capacity, long time to develop, Lack the flexibility –NP(Network Processor) »achieve high processing performance »programming flexibility »Cheaper than GP

14 DAP Spr.‘98 ©UCB 14 IXP1200 Block Diagram StrongARM processing core Microengines introduce new ISA I/O –PCI –SDRAM –SRAM –IX : PCI-like packet bus On chip FIFOs –16 entry 64B each

15 DAP Spr.‘98 ©UCB 15 IXP1200 Microengine 4 hardware contexts –Single issue processor –Explicit optional context switch on SRAM access Registers –All are single ported –Separate GPR –256*6 = 1536 registers total 32-bit ALU –Can access GPR or XFER registers Shared hash unit –1/2/3 values – 48b/64b –For IP routing hashing Standard 5 stage pipeline 4KB SRAM instruction store – not a cache! Barrel shifter Ref: [NPT]

16 DAP Spr.‘98 ©UCB 16 MEv2 6 MEv2 7 MEv2 5 MEv2 8 Intel® XScale™ Core 32K IC 32K DC Rbuf 64 @ 128B Tbuf 64 @ 128B Hash 64/48/128 Scratch 16KB QDR SRAM 1 QDR SRAM 2 DDRAM GASKETGASKET PCI (64b) 66 MHz 32b 32b 18181818 72 64b S P I 3 or C S I X E/D Q MEv2 2 MEv2 3 MEv2 1 MEv2 4 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/Slow Port IXP2400

17 DAP Spr.‘98 ©UCB 17 Intel® XScale™ Core 32K IC 32K DC MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 Rbuf 64 @ 128B Tbuf 64 @ 128B Hash 48/64/128 Scratch 16KB QDR SRAM 2 QDR SRAM 1 RDRAM 1 RDRAM 3 RDRAM 2 GASKETGASKET PCI (64b) 66 MHz IXP2800 16b 16b 18181818 181818 64b S P I 4 or C S I X Stripe E/D Q QDR SRAM 3 E/D Q 1818 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 CSRs -Fast_wr-UART -Timers-GPIO -BootROM/SlowPort QDR SRAM 4 E/D Q 1818

18 DAP Spr.‘98 ©UCB 18 Memory Hierarchy Goal: Illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy of Levels –Uses smaller and faster memory technologies close to the processor –Fast access time in highest level of hierarchy –Cheap, slow memory furthest from processor The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level

19 DAP Spr.‘98 ©UCB 19 Introduction to Caches Cache –is a small very fast memory (SRAM, expensive) –contains copies of the most recently accessed memory locations (data and instructions): temporal locality –is fully managed by hardware (unlike virtual memory) –storage is organized in blocks of contiguous memory locations: spatial locality –unit of transfer to/from main memory (or L2) is the cache block General structure –n blocks per cache organized in s sets –b bytes per block –total cache size n*b bytes

20 DAP Spr.‘98 ©UCB 20 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache, how to find it? Answer to (1) and (2) depends on type or organization of the cache Direct mapped cache, each memory address is associated with one possible block within the cache –Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Fully Associative Cache – Block can be placed anywhere, but complex in design N-way set associative - N cache blocks for each Cache Index –Like having N direct mapped caches operating in parallel

21 DAP Spr.‘98 ©UCB 21 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

22 DAP Spr.‘98 ©UCB 22 Review: Cache Performance CPUtime = Instruction Count x (CPI execution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time To Improve Cache Performance: 1. Reduce the miss rate 2. Reduce the miss penalty 3. Reduce the time to hit in the cache.

23 DAP Spr.‘98 ©UCB 23 Where to misses come from? Classifying Misses: 3 Cs –Compulsory —The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) –Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) –Conflict —If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) 4th “C”: –Coherence - Misses caused by cache coherence.

24 DAP Spr.‘98 ©UCB 24 4: Add a second-level cache L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 + Miss Penalty L2 ) Definitions: –Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) –Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU –Global Miss Rate is what matters

25 DAP Spr.‘98 ©UCB 25 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 miss rate miss penalty

26 DAP Spr.‘98 ©UCB 26 Main Memory Background Random Access Memory (vs. Serial Access Memory) Different flavors at different levels –Physical Makeup (CMOS, DRAM) –Low Level Architectures (FPM,EDO,BEDO,SDRAM) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs. 1 transistor Size: DRAM/SRAM ­ 4-8, Cost/Cycle time: SRAM/DRAM ­ 8-16 Main Memory is DRAM: Dynamic Random Access Memory –Dynamic since needs to be refreshed periodically –Addresses divided into 2 halves (Memory as a 2D matrix): »RAS or Row Access Strobe »CAS or Column Access Strobe

27 DAP Spr.‘98 ©UCB 27 Main Memory Organizations CPU Cache Bus Memory CPU Bus Memory Multiplexor Cache CPU Cache Bus Memory bank 1 Memory bank 2 Memory bank 3 Memory bank 0 one-word wide memory organization wide memory organization interleaved memory organization DRAM access time >> bus transfer time

28 DAP Spr.‘98 ©UCB 28 Virtual Memory Idea 1: Many Programs sharing DRAM Memory so that context switches can occur Idea 2: Allow program to be written without memory constraints – program can exceed the size of the main memory Idea 3: Relocation: Parts of the program can be placed at different locations in the memory instead of a big chunk. Virtual Memory: (1) DRAM Memory holds many programs running at same time (processes) (2) use DRAM Memory as a kind of “cache” for disk

29 DAP Spr.‘98 ©UCB 29 Mapping Virtual to Physical Address Virtual Page NumberPage Offset Physical Page Number Translation 31 30 29 28 27.………………….12 11 10 29 28 27.………………….12 11 10 9 8 ……..……. 3 2 1 0 Virtual Address Physical Address 9 8 ……..……. 3 2 1 0 1KB page size

30 DAP Spr.‘98 ©UCB 30 How Translate Fast? Problem: Virtual Memory requires two memory accesses! –one to translate Virtual Address into Physical Address (page table lookup) –one to transfer the actual data (cache hit) –But Page Table is in physical memory! => 2 main memory accesses! Observation: since there is locality in pages of data, must be locality in virtual addresses of those pages! Why not create a cache of virtual to physical address translations to make translation fast? (smaller is faster) For historical reasons, such a “page table cache” is called a Translation Lookaside Buffer, or TLB

31 DAP Spr.‘98 ©UCB 31 Translation Look-Aside Buffers TLB is usually small, typically 32-4,096 entries Like any other cache, the TLB can be fully associative, set associative, or direct mapped Processor TLBCache Main Memory miss hit data hit miss Disk Memory OS Fault Handler page fault/ protection violation Page Table data virtual addr. physical addr.


Download ppt "DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2."

Similar presentations


Ads by Google