Download presentation
Presentation is loading. Please wait.
Published byRobyn Liliana Harrell Modified over 9 years ago
1
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation
2
2 Outline Introduction Problem Definition Profiling Techniques Pipelined Binary Search Tree ProMem Conclusions
3
3 Introduction Mem Processor Per. I$ D$ Bridge Monitor Embedded Bus ProMem Our Solution: Add On-Chip Profiler Memory to Monitored Bus Accepts 1 pattern/cycle Keeps Exact Counts Goal: Determine # of Times Each Target Pattern Appears on the Bus Monitor Embedded Bus
4
4 Introduction Mem Processor Per. FPGA prog.c void compute() { // small Loop A for(i=0;…;…) … // small Loop B for(x=0;…;…) } … Loop N Instructions Loop A … Profile Information Profile Move Loop A to HW Synthesis Configure FPGA FPGA Most Instructions Executed
5
5 Introduction Profiling Can Be Used to Solve Many Problems –Optimization of frequently executed subroutines –Mapping frequently executed code and data to non-interfering cache regions –Synthesis of optimized hardware for common cases –Identifying frequent loops to map to a small low- power loop cache –Many Others!
6
6 Problem Definition Objective –Count number of times each target pattern appears on bus B Requirements –Accept input patterns on every clock cycle –Monitoring any bus, e.g., deeply embedded buses in SOCs –Non-intrusive –Exact target pattern count TPCTP tp 1 11203 tp 2 8876 …… tp m ctp m Target Patterns TP = {tp i, …, tp m } Target Pattern Counts CTP = {ctp i, …, ctp m } Mem Processor Per. p1pmpm …p2 Bus B Input Patterns P={p i, …, p m }
7
7 Profiling Techniques - Software Instrumenting Software –Adding code to count frequencies of desired code regions Problems –Incurs runtime overhead –Possibly changes program behavior –Increase in code size for( … ){ … ctp m ++; } MemProcessor Per. p1pmpm … p2 prog.c
8
8 Profiling Techniques - Software Periodic Sampling –Interrupt processor at periodic interval –Read program counter and other internal registers Problems –Disruption of runtime behavior during interrupt –Inaccurate // ISR period = 10ms ISR{ //update profile info } MemProcessor p1pmpm … p2 prog.c Per.
9
9 Profiling Techniques - Software Simulation –Execute application on instruction set simulator –Simulator keeps track of profile information Problems –Difficult to model external environment which leads to inaccuracy –Extremely slow ISS prog.c profile information
10
10 Profiling Techniques - Hardware Logic Analyzer –Probes placed directly on bus to be monitored Problems –Cannot monitor embedded buses MemProcessor p1pmpm … p2 Per.
11
11 Profiling Techniques - Hardware Processor Support –Mainly event counters –Monitored events include cache misses, pipeline stalls, etc. Problems –Few registers available –Reconfiguration needed to obtain a complete profile –Leads to inaccuracy MemProcessor p1pmpm … p2 Per.
12
12 Profiling Techniques - Hardware Content-addressable memories (CAMs) –Fast search for a key in a large data set –Returns the address at which the key resides in a memory Types –Fully Associative –RAM coupled with a smart controller Mem Processor CAM p1pmpm … p2 Per.
13
13 Profiling Techniques - Hardware Fully Associative CAMs –Simultaneously compares every location with the key Problems –Does not scale well to larger memories –Increased access time as CAM size grows –Large Power Consumption Mem Processor CAM p1pmpm … p2 tp1 tpm … = = tp2 = tp3 = Per.
14
14 Profiling Techniques - Hardware RAM coupled with a smart controller –Efficient lookup data structure in memory such as a binary tree or Patricia Trie Problems –Multiple cycle lookup Ctrl SRAM Mem Processor CAM p1pmpm … p2 Per.
15
15 Observations Not necessary to have 1 cycle look up Only need to accept one input pattern every cycle
16
16 Queueing Hold input patterns in queue until we are able to process them Problems –Does not work with patterns arriving every clock cycle Ctrl SRAM CAM Bus B FIFO
17
17 Pipelining Implemented in processors such that instructions can be executed every cycle Can we use pipelining to solve our problem?
18
18 Pipelined CAM Large CAMs required long access times Partition large CAM into several smaller CAMs –Requires pipelining to reduce access time –Provides solution to access time problem –Requires Large Area –Large Power Consumption CAM Pipeline Reg CAM
19
19 Pipelined CAM Entries can be stored in a CAM in any order –requires sequential lookup in pipelined CAM approach Is there a benefit to sorting the entries? –not necessary to search all entries –leads to faster lookup time Tree structure provides a inherently sorted structure –Search time remains a problem –Can we pipeline the structure?
20
20 Pipelined Tree Solves access time problem –One memory access per level Solves area problem –Single comparator per level –Each level grows by factor of two –For large memories, comparators are negligible = = = =
21
21 Pipelined Binary Search Tree Root Node Each node has at most two children Left child > Parent Right child < Parent ace jd bfik g h
22
22 Pipelined Binary Search Tree Searching for Input Pattern: f ace jd bfik g h h f > d, go left d f < h, go right h f = f, Found! f d h Stage 0 Stage 1 Stage 2 Stage 3
23
23 Pipelined Binary Search Tree ace jd bfik g h 001011000 110010 10 01 010 e = e, Found! f 01 d h 0 e 010 e < f, append 0 to address f 01 d h 0 010 e > d, append 1 to address 01 d h 0 e < h, append 0 to address h 0 Stage 0 Stage 1 Stage 2 Stage 3 Searching for Input Pattern: e
24
24 Pipelined Binary Search Tree ace jd bfik g h Stage 0 Stage 1 Stage 2 Stage 3 Searching for Input Pattern: e, f e < f, append 0 to address f 010 d f > d, append 1 to address 01 e < h, append 0 to address h 0 e = e, Found! e f f = f, Found! e > d, append 1 to address 01 d f < h, append 0 to address h 0
25
25 Pipelined Binary Search Tree Stage 0 Stage 1 Stage 2 Stage 3 -------- 001011000 110010 10 01 010001011000010 ace jd bfik g h Standard Memories
26
26 ProMem – Module Design Input PatternSearch Address Enable Search Address (Next Stage) Enable (Next Stage) > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Input Pattern
27
27 ProMem – Module Design > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Target Pattern Memory TPM s (2 s ×w) addr rd dout
28
28 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design TPM s (2 s ×w) addr rd dout Target Pattern Not Found – Enable Next Stage Target Pattern Found Search for Target Pattern Compare > =
29
29 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout Target Pattern Count Memory CM s (2 s ×c) wr rd addr dout
30
30 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout When Target Pattern Found - Update Count Value +1 1
31
31 > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s ProMem – Module Design Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout +1 1 Pipeline Register Memories ModuleController
32
32 ProMem - Interface Simple Interface –Internal interface Enable signal Connection to monitored bus –External interface Read enable Write enable Connection to ProMem pattern input bus Mem Processor ProMem p1pmpm … p2 ren wen addrcen clk Per.
33
33 ProMem - Layout Efficient Layout –Achieved by simply abutting each module with the next –Results in very short bus wires between each module > p s > A s > cen cen_op s _oA s+1 _o p s _iA s _icen_i Pipeline regs ProMem stage s Compare > = TPM s (2 s ×w) addr rd dout CM s (2 s ×c) wr rd addr dout +1 1
34
34 ProMem Results – Area* Module overhead only 1% *Area obtained using UMC.18 technology library provided by Artisan Components
35
35 ProMem Results – vs. CAM CAM design is 46% larger than ProMem
36
36 ProMem Results – Timing vs. CAM CAM access time grows with CAM size ProMem access time remains constant (Due to Pipelining)
37
37 Conclusions Introduced a new memory structure specifically for fast on-chip profiling One pattern per cycle throughput Simple interface to monitored bus Efficient design is very scalable
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.