Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group
Contents Cache Logical View Physical View Case Study – Power 4 L2 Cache Cache Logical View Physical View Case Study – Power 4 L2 Cache
Logical Cache Structure n-way associative cache n-elements per set 2 m Sets TagIndex Address (32 bits) = = Data Hit m 32 – m - k … Offset k or
Cache Structure
Cache Access Steps 1. Decode address 2. Enable the word line 3. Raise the bit lines to high 4. Get the tag value from the tag array 5. Check for tag match 6. Select data output Steps 1. Decode address 2. Enable the word line 3. Raise the bit lines to high 4. Get the tag value from the tag array 5. Check for tag match 6. Select data output
Conventional Cache Organization Memory Cell
Read: Set bit and bit´ high If the value in the cell is 1, then bit´ is discharged. It the value is 0, then bit is discharged Write: Set bit´ to 0. This forces 1 in the latch.
Decoder with Driver
Various Components Comparator is xor logic Multiplexer hierarchy for offset. First get block (from output drive), then word, then byte Output Driver Maximum of one input bits high If input 0, then high resistant output … I0I1I7
Banking Idea: Support Multiple Cache Accesses Solution: Use multiporting on bit cells (Cost is big) Divide the cache into independent banks
Cache Search Steps: 1. Find Bank (bank index) 2. Find Set in Bank (index) 3. Check if data is valid and in the cache (tag match) 4. If all ok return data (block and byte offset), else check lower level memory Steps: 1. Find Bank (bank index) 2. Find Set in Bank (index) 3. Check if data is valid and in the cache (tag match) 4. If all ok return data (block and byte offset), else check lower level memory
Case Study - Power 4 Dual Core 64-bit Processors 32KB L1 D-Cache (Per Processor) 2-way associative 128 Bytes Line 64KB L1 I-Cache (Per Processor) Direct Mapped 128 Bytes Line (4 sectors x 32B) ~1.5MB L2 Cache 8-way set associative 128 Bytes line
Power4 Floorplan
Power4 L2 Logical View Cache Split into 3 Parts, 0.5Mb each Control by 4 Coherency Processors 1 64B Store Queue per Processor
Power4 L2U ~512 KB 8 Banks 128 B block size 8-way associative Word lines Bit lines Decoders Address Bus
Power4 L2 Cache Block Size C = 512 KB = 2 19 B Block Size = 128 B = 2 7 B 8-way associative 8 Banks per Cache Block Therefore: Set Size is 2 3 *2 7 B= 2 10 B Sets in Cache are 2 19 /2 10 = 2 9 sets Sets per Bank are 2 9 / 2 3 = 2 6 sets L2 Cache Block Size C = 512 KB = 2 19 B Block Size = 128 B = 2 7 B 8-way associative 8 Banks per Cache Block Therefore: Set Size is 2 3 *2 7 B= 2 10 B Sets in Cache are 2 19 /2 10 = 2 9 sets Sets per Bank are 2 9 / 2 3 = 2 6 sets tagindexoffset bank indexset index 64-bit
Power4: CACTI Results cacti um CACTI version Cache Parameters: Number of Subbanks: 8 Total Cache Size: Size in bytes of Subbank: Number of sets: 64 Associativity: 8 Block Size (bytes): 128 Read/Write Ports: 1 Read Ports: 0 Write Ports: 0 Technology Size: 0.80um Vdd: 4.5V Access Time (ns): Cycle Time (wave pipelined) (ns): Total Power all Banks (nJ): Total Power Without Routing (nJ): Total Routing Power (nJ): Maximum Bank Power (nJ): Best Ndwl (L1): 16 Best Ndbl (L1): 1 Best Nspd (L1): 1 Best Ntwl (L1): 1 Best Ntbl (L1): 1 Best Ntspd (L1): 1 Nor inputs (data): 2 Nor inputs (tag): 2 cacti um CACTI version Cache Parameters: Number of Subbanks: 8 Total Cache Size: Size in bytes of Subbank: Number of sets: 64 Associativity: 8 Block Size (bytes): 128 Read/Write Ports: 1 Read Ports: 0 Write Ports: 0 Technology Size: 0.80um Vdd: 4.5V Access Time (ns): Cycle Time (wave pipelined) (ns): Total Power all Banks (nJ): Total Power Without Routing (nJ): Total Routing Power (nJ): Maximum Bank Power (nJ): Best Ndwl (L1): 16 Best Ndbl (L1): 1 Best Nspd (L1): 1 Best Ntwl (L1): 1 Best Ntbl (L1): 1 Best Ntspd (L1): 1 Nor inputs (data): 2 Nor inputs (tag): 2 cacti um CACTI version Cache Parameters: Number of Subbanks: 16 Total Cache Size: Size in bytes of Subbank: Number of sets: 32 Associativity: 8 Block Size (bytes): 128 Read/Write Ports: 1 Read Ports: 0 Write Ports: 0 Technology Size: 0.80um Vdd: 4.5V Access Time (ns): Cycle Time (wave pipelined) (ns): Total Power all Banks (nJ): Total Power Without Routing (nJ): Total Routing Power (nJ): Maximum Bank Power (nJ): Best Ndwl (L1): 16 Best Ndbl (L1): 1 Best Nspd (L1): 1 Best Ntwl (L1): 1 Best Ntbl (L1): 1 Best Ntspd (L1): 1 Nor inputs (data): 2 Nor inputs (tag): 2
CACTI Data Array Ndwl: World line split factor Ndbl: Bit line split factor Nspd: Number of sets mapped to a single word line (sectors) Tag Array Ntwl: World line split factor Ntbl: Bit line split factor Nspt: Number of sets mapped to a single word line (sectors) Increase of Ndbl, Nspd, Ntbl, Nspt requires the increase of sense amplifiers Increase of Ndwl and Ntwl increases the number of word line drivers Data Array Ndwl: World line split factor Ndbl: Bit line split factor Nspd: Number of sets mapped to a single word line (sectors) Tag Array Ntwl: World line split factor Ntbl: Bit line split factor Nspt: Number of sets mapped to a single word line (sectors) Increase of Ndbl, Nspd, Ntbl, Nspt requires the increase of sense amplifiers Increase of Ndwl and Ntwl increases the number of word line drivers
Thank You