IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Memory.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Virtual Memory Chapter 18 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1 High-performance packet classification algorithm for multithreaded IXP network processor Authors: Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang.
Memory Management.
Memory Organization.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Systems Architecture and Hierarchical Memory Systems
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
IXP Lab 2012: Part 1 Network Processor Brief. NCKU CSIE CIAL Lab2 Outline Network Processor Intel IXP2400 Processing Element Register Memory Interface.
SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.:
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
IXP Lab 2012: Part 2 Example Functions. Outline Demo-1: Packet Counter – 1ME Version Demo-2: Packet Counter – nMEs Version Demo-3: Packet Size Counter.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
ECE232: Hardware Organization and Design
Lecture 12 Virtual Memory.
From Address Translation to Demand Paging
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
/ Computer Architecture and Design
Cache Memories September 30, 2008
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
Module IV Memory Organization.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Lecture 20: OOO, Memory Hierarchy
/ Computer Architecture and Design
Contents Memory types & memory hierarchy Virtual memory (VM)
CS 3410, Spring 2014 Computer Science Cornell University
Some of the slides are adopted from David Patterson (UCB)
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Cache - Optimization.
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Authors: Duo Liu, Bei Hua, Xianghui Hu and Xinan Tang
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

IXP Lab 2012: Part 3 Programming Tips

Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing Overhead Reduce the number of memory accesses Reduce average access latency – Hiding Overhead NCKU CSIE CIAL Lab 2

Memory Independent Techniques Instruction Selection – General Coding Skill – Use Hardware Instruction Task Partition – Multi-Processing – Context-Pipelining NCKU CSIE CIAL Lab 3

General Coding Skill Remove loop Shift Operation – Avoid using multiply and divide Inline Function – __inline & __forceinline Branch Prediction – Branch Prediction Penalty NCKU CSIE CIAL Lab 4

Hardware Instruction POP_COUNT FFS Multiply CRC Hashing CAM NCKU CSIE CIAL Lab 5

POP_COUNT --Brief Population Count Report number of bit set in a 32-bit register 3 cycles latency Example: – pop_count( 0x3121 ) = ? – – Result = 5 NCKU CSIE CIAL Lab 6

POP_COUNT --Na ï ve Implementation unsigned int pop_count_for (unsigned int x) { unsigned int y=0; unsigned int i; for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y; } NCKU CSIE CIAL Lab 7

POP_COUNT --Faster Implementation unsigned int pop_count_agg(unsigned int x) { x -= ((x >> 1) & 0x ); x = (((x >> 2) & 0x ) + (x & 0x )); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x f);} } Reference NCKU CSIE CIAL Lab 8

POP_COUNT --Hardware Instruction unsigned int pop_count_hardware(unsigned int x) { return pop_count (x); } NCKU CSIE CIAL Lab 9

POP_COUNT --Additional Information Bitmap-RFC (Liu, TECS 2008) NCKU CSIE CIAL Lab 10

FFS Find the first bit set in data and return its position Example: – ffs ( 0x3121 ) = – ffs ( 0x3120 ) = – ffs ( 0x3100 ) = NCKU CSIE CIAL Lab 11

Multiply Specific Multiply Instruction – Multiply_24x8() – Multiply_16x16() – Multiply_32x32_hi() – Multiply_32x32_lo() NCKU CSIE CIAL Lab 12

CRC 14 cycles latency Example of CRC operation crc_write( 0x ); crc_32_be( source_address, bytes_0_3 ); crc_32_be( dest_address, bytes_0_3 ); … Cache_index = crc_read(); NCKU CSIE CIAL Lab 13

Hash hash_48() hash_64() hash_128() Example: SIGNAL sig_hash; hash48(data_out, data_in, count, sig_done, &sig_hash); __wait_for_all(&sig_hash); NCKU CSIE CIAL Lab 14

CAM --Brief Content Addressable Memory Each ME has bit CAM entries The CAM is private to other MEs With lookup operation, each entries is searching in parallel With a success lookup, the index of matched entries will be returned Else, the index of entries to be replaced will be returned NCKU CSIE CIAL Lab 15

CAM --Structure cam_lookup_t NCKU CSIE CIAL Lab 16

CAM --Usage cam_lookup_t cam_result; cam_result = cam_lookup( data ); if( cam_result.hit == 1 ) { Access Entry cam_result.entry_num; … } else { …… cam_write( cam_result.entry_num, data, 15 ); } NCKU CSIE CIAL Lab 17

Task Partition Multi-Processing – More Computing Power – Easy to implement Context-Pipelining – More Useable Resource – Hard to balance NCKU CSIE CIAL Lab 18

Memory Relative Techniques --Reducing Overhead Reduce the number of memory accesses – Wide-word Accesses – Result Caches Reduce average access latency – Multi-level Memory Hierarchy – Data Cache NCKU CSIE CIAL Lab 19

Wide-Word Accesses --Brief Batch Access the needed data Reduce the necessary accesses Useful when the data stored contiguously NCKU CSIE CIAL Lab 20 MEM_ADDR+0…… +4…… +8…… +12…… +16…… +20…… +24…… +28……

Wide-Word Accesses --Usage (One Node per Access) __declspec(sram_read_reg) UINT32 A; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A Result: 8 Accesses are needed NCKU CSIE CIAL Lab 21

Wide-Word Accesses --Usage (Two Node per Access) __declspec(sram_read_reg) UINT32 A[2]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A Result: 4 Accesses are needed NCKU CSIE CIAL Lab 22

Wide-Word Accesses --Usage (Four Node per Access) __declspec(sram_read_reg) UINT32 A[4]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A Result: 2 Accesses are needed NCKU CSIE CIAL Lab 23

Wide-Word Accesses --Experiment Platform: IXP2800 Total Accesses: 8 LW (8*4 Byte) CaseTotal CycleAverage Cycles/ LW 1LW * 8 Time LW * 4 Time LW * 2 Time LW * 1 Time NCKU CSIE CIAL Lab 24

Wide-Word Accesses --Limitation Data must be contiguous – Suitable for linear search – Not support random accesses Number of Transfer Registers are fixed – Each thread has 16 read / write registers – The Tx-Regs may be reserved by others NCKU CSIE CIAL Lab 25

Resulting Cache --Brief Caching the result of application If same fields appear again, the cached result is returned Memory accesses are reduced when cache hit. Depends on temporal locality of the traffic NCKU CSIE CIAL Lab 26

Result Cache --IXP2400 No hardware cache is supported in IXP2400 ME Not easy to implement set-associative cache Replacement policy will also be an overhead NCKU CSIE CIAL Lab 27

Result Cache --Design Consideration Shared or Private Cache ? Size of Cache ? Works with specific Hardware ? Miss penalty handling ? NCKU CSIE CIAL Lab 28

Result Cache --Example NCKU CSIE CIAL Lab 29

Multi-Level Memory Hierarchy --Brief Reduce the average access latency Number of accesses remained unchanged If data can fit in faster memory, then do it NCKU CSIE CIAL Lab 30

Multi-Level Memory Hierarchy --Data Placement Size smaller while read-only – Hard Code Size smaller while need updating – Local Memory Size larger – Scratchpad Size largest – SRAM NCKU CSIE CIAL Lab 31

Multi-Level Memory Hierarchy --Packet Data Type Packet related data – Temporary Data – Valid with specific packet – Local Memory Flow related data – Related to specific flow – Spatial Locality – Wide-Word Access Application related data – Valid with specific application – Temporal Locality – Result Cache NCKU CSIE CIAL Lab 32

Split-Cache (Z. Liu, IET-COM 2007) Two separate hardware for application data and flow data NCKU CSIE CIAL Lab 33

Data Cache --Brief Hardware Cache Mechanism that cached the data for packet processing – App-Cache – Flow-Cache However, not supported by IXP2400 (Need additional hardware) NCKU CSIE CIAL Lab 34

Data Cache --CAM + Local Memory CAM works with Local Memory acts like hardware cache However, number of CAM entries is limited Each CAM entry may co-worked with several Local Memory Cache entry NCKU CSIE CIAL Lab 35

Memory Relative Techniques --Hiding Overhead Not really reduce the overhead, but overlapped it – Hardware Multi-Threading – Asynchronous Memory NCKU CSIE CIAL Lab 36

Hardware Multi-Threading Swap out itself and let another thread to execute while access memory Each thread kept its own set of registers, thus no stack are needed for thread swapping Round Robin Scheduling No thread preemptive NCKU CSIE CIAL Lab 37

Asynchronous Memory --Brief Thread will not be blocked when issue a memory request Thus, thread can issues multiple memory requests at a time NCKU CSIE CIAL Lab 38

Asynchronous Memory --Example (1 Issue) Read X __wait_for_all ( &sig_x ) Read Y __wait_for_all ( &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab 39

Asynchronous Memory --Example (2 Issues) Read X Read Y __wait_for_all ( &sig_x, &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab 40

Wide-Word Access + Multiple Issues MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 41

Wide-Word Access + Multiple Issues (1LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 42

Wide-Word Access + Multiple Issues (2LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 43

Wide-Word Access + Multiple Issues (4LW, 2 Issue) MEM_ADDR+0 …… +4 …… +8 …… +12 …… +16 …… +20 …… +24 …… +28 …… NCKU CSIE CIAL Lab 44

Wide-Word Access + Multiple Issues (Experiment) SchemeTotal CyclesAverage Cycles / LW 1 LW * 1 Issue LW * 1 Issue LW * 1 Issue LW * 1 Issue LW * 2 Issue LW * 2 Issue LW * 2 Issue LW * 4 Issue LW * 4 Issue LW * 8 Issue NCKU CSIE CIAL Lab 45

Reference (1) Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “ Overcoming the memory wall in packet processing: hammers or ladders? ”, Proc. ANCS Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “ High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor ”, ACM TECS NCKU CSIE CIAL Lab 46

Reference (2) Z. Liu, K. Zheng, B. Liu, “ Hybrid cache architecture for high-speed packet processing ”, IET-COM NCKU CSIE CIAL Lab 47