XML Parsing # alert tcp $ExXTERNAL_NET any -> $HOME_NET $HTTP_PORTS (msg:"PROTOCOL-SCADA Cogent DataHub server-side information disclosure"; flow:to_server,established; content:".asp."; nocase; http_uri; pcre:"/\x2easp\x2e($|\?)/iU"; metadata:service http; reference:cve, ; classtype:web-application-attack; sid:20174; rev:4;)xx Network Intrusion Detection Video Decoding Donald Duck Donald Fauntleroy Duck Donald F Duck D F Duck Data Analytics /([STX])(.{2}?)([DBEZX])/ /([RKX])(.{2,3}?)([DBEZX])(.{2,3}?)([YX])/ /([GX])([^EDRKHPFYW])(.{2}?)([STAGCNBX])([^P])/ Motif search Particle Path Tracking">
Download presentation
Presentation is loading. Please wait.
Published byたけなり つなかわ Modified over 5 years ago
1
Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of Michigan – Ann Arbor MICRO-50, Boston, USA – October 17th, 2017
2
Pattern matching in abundance …
[07/Dec/2016:11:04: ] "GET /HTTP/1.1" "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/ Firefox/49.0" Log Processing / right/JJ to/[^\s]+ / / the/[^\s]+ back/RB / / [^/]+/DT longer/[^\s]+ / / ,/[^\s]+ have/VB / / [^/]+/VBD by/[^\s]+ / Natural Language Processing <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date> </publish_date> <description>An in-depth look at creating applications with XML.</description> </book> XML Parsing # alert tcp $ExXTERNAL_NET any -> $HOME_NET $HTTP_PORTS (msg:"PROTOCOL-SCADA Cogent DataHub server-side information disclosure"; flow:to_server,established; content:".asp."; nocase; http_uri; pcre:"/\x2easp\x2e($|\?)/iU"; metadata:service http; reference:cve, ; classtype:web-application-attack; sid:20174; rev:4;)xx Network Intrusion Detection Video Decoding Donald Duck Donald Fauntleroy Duck Donald F Duck D F Duck Data Analytics /([STX])(.{2}?)([DBEZX])/ /([RKX])(.{2,3}?)([DBEZX])(.{2,3}?)([YX])/ /([GX])([^EDRKHPFYW])(.{2}?)([STAGCNBX])([^P])/ Motif search Particle Path Tracking
3
Finite State Automata extract patterns
x,\s \n \s, \n x T x \s \n S0 S1 S2 S0 S1 S2 Σ = {x, \s, \n}
4
Compute-Centric Architectures
Switch-Case 1 while (c != EOF) { 2 if (c == ‘\n’) { putchar('\n'); *state = S0; } 3 else 4 switch(*state) { case S0: if(c != ‘\s’) { putchar(c); *state = S1; } break; case S1: if (c == ‘\s’) { *state = S2; } else { putchar(c); } break; case S2: break; } 13 } Branch Mispredictions ! Irregular memory accesses !
5
Compute-Centric Architectures
Table-Lookup 1 struct T_table [3][3] = { 2 { {S0, S1}, {S0, S0}, {S0, S0} }, 3 { {S1, S1}, {S1, S2}, {S1, S0} }, 4 { {S2, S2}, {S2, S2}, {S2, S0} } }; 5 6 while (c != EOF) { 7 int id1 = (c == ‘\s’) ? 0 : (c == ‘\n’) ? 1 : 2 8 *state = T_table [*state] [id] [1] 9 putchar(c); 10 } Memory bandwidth bottlenecked ! Limited state transitions per cycle !
6
Memory-Centric Architectures
Micron’s Automata Processor (AP) 1 rank with 8 chips 48k state transitions per cycle per chip 256x faster than CPU 170x faster than GPU (iNFAnt2) [ANMLZoo, Wadden et al. ’16 ]
7
In-memory automata processing
start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent
8
In-memory automata processing
start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent report Each state has incoming transitions on one input symbol
9
In-memory automata processing
start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent Equivalent start S1 b i a report Homogeneous NFA
10
State representation .. 1 .. 1 1 .. One-hot encoding
.. 1 97 255 b .. 1 98 255 * 1 .. 255 One-hot encoding 8-bit ASCII alphabet
11
In-memory automata processing
Input symbols start S1 b i a report
12
In-memory automata processing
a b a. . . Input symbols start S1 b i a report 1 State-Match
13
Massively Parallel State-Match
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 Repurpose row address as input symbol Input symbols . . a b a Active State Vector
14
Massively Parallel State-Match
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 1 1 1 1 1 Input symbols . . a b a Active State Vector
15
Massively Parallel State-Match
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 1 1 1 1 1 Input symbols . . a b a Active State Vector
16
Massively Parallel State-Match
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector
17
In-memory automata processing
a b a. . . Input symbols start S1 b i a report 1 State-Match 2 State-Transition
18
Massively Parallel State-Transition
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector &
19
Massively Parallel State-Transition
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Bit-parallel state-transition Active State Vector Custom interconnect
20
Massively Parallel State-Transition
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector Custom interconnect
21
Are SRAM-based caches suitable for automata processing ?
Cache Automaton ? ! DRAM technology energy inefficient and AP is slow (~133 MHz) Low packing density (12 MB AP = 200 MB regular DRAM) Off-chip host to accelerator communication overhead ! ! Are SRAM-based caches suitable for automata processing ?
22
Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation
23
Opportunity ✓ SRAM-based caches are faster, more energy efficient than DRAM ✓ Integrated on processor dies with performance-optimized logic ! Is cache capacity an issue for large NFA ?
24
Opportunity Form Factor Capacity # States DRAM 1 rank (8 devices)
200 MB - DRAM-based AP 12 MB 384,000,000 Last-Level Cache 1 socket 25 MB 400,000,000
25
Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation
26
Challenges 1 Using cache hierarchy is slow
27
Using the cache hierarchy is slow !
L1 cache ~ 2 cycles L2 cache ~ 12 cycles L3 cache ~ 36 cycles Credits: Intel Low operating frequency ~ 111 MHz
28
Challenges 2 Efficient state-match and state-transition in cache
29
Column multiplexing reduces bit-parallelism
Row decoder 255 Repurpose row address as Input symbol Repurpose memory columns to store FSM states Bit-level parallelism State-match
30
Scalable interconnect for state-transition in cache
Crossbar ? ! Overprovisioned w/ dynamic arbitration, multi-bit ports ? Infeasible Crossbar 100,000 states 100,000 x 100,000 Network Architecture
31
Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation
32
In-situ computation cognizant of cache geometry can provide benefits
1
33
Intel Xeon Last-Level Cache
2.5 MB Slice LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank Tag, State, LRU
34
Intel Xeon Last-Level Cache
2.5 MB Slice 16kB subarray LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder Tag, State, LRU
35
Intel Xeon Last-Level Cache
2.5 MB Slice LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank Tag, State, LRU I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder WL Row decoder 255 8kB SRAM array BL /BLB
36
Intel Xeon Last-Level Cache
2.5 MB Slice 16kB subarray LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank 8kB SRAM array I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder 255 WL Row decoder 255 BL /BLB @ 4 GHz Tag, State, LRU
37
Sense-amplifier cycling enables highly parallel, low latency state-match
2
38
Accelerating state-match is challenging …
Column-multiplexing Read sequence ! 4 cycles to read 4 adjacent bits (state-match results)
39
Sense-amplifier cycling
Read sequence ! 4 cycles to read 4 adjacent bits (state-match results) Read sequence (optimized) ✓ < 2 cycles to read 4 adjacent bits (state-match results)
40
8T-based SRAM arrays can be repurposed to compactly encode state-transitions
3
41
Accelerating state-transition
Crosspoint Enable bit 2T read stack Compact 8T SRAM-based switches as building block of programmable interconnect
42
Real-world NFA can be grouped into dense regions with sparse connectivity
4
43
Hierarchical network architecture
Hierarchical switch topology scales to large NFA L-switch G-switch Dense Sparse Real-world NFA
44
Putting it together … L-switch CBOX Way 2 Way 1 Way 19 Way 20
Local Switch Way 2 Way 1 Way 19 Way 20 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0
45
Cache Automaton Architecture
L-switch G-switch-1 CBOX LS Local Switch Way 2 Way 1 Way 19 Way 20 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0
46
Cache Automaton Architecture
LS Local Switch CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0 Input buffer Output buffer G-switch-4 G-switch-1 L-switch H-bus Way 2 Way 1 Way 19 Way 20
47
Cache Automaton Architecture
16b 8b 256 b From G-Switch-1 Row decoder 8-bit Input 255 280x256 L-Switch Match Vector Active State Vector STE(0) STE(1) STE(2) STE(255) To G-Switch-4 Report Vector To G-Switch-1 From G-Switch-4 256 state partition (2 x 4KB SRAM Arrays) LS Local Switch CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0 Input buffer Output buffer G-switch-4 G-switch-1 L-switch I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder Output bit
48
Merging connected components
a r * start * start c a r e * c a r e start a t c a r t * start m e l c a m e l * start Merging connected components
49
Cache Automaton designs
Performance-optimized (CA_P) ✓ Small connected components -> low radix switches Redundant state activity Space-optimized (CA_S) ✓ Low footprint, no redundant state activity Large connected components -> high radix switches and careful mapping
50
Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation
51
Hierarchical switch topology scales to large NFA
G-switch Dense Sparse Real-world NFA Compiler
52
k-way graph partitioning
Compiler CC0 (75) CC1(87) CC3(768) CC2(174) CC4(4568) CC0 CC1 CC2 CC3 CC4 Greedy packing k-way graph partitioning 16kB subarray Array_H Array_L Portion of LLC slice with local and global switches Array_H Array_L Way 0 Way 1 Way 2 Way 3 Unmapped 32kB data bank
53
Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation
54
✓ ✓ ✓ ✓ ✓ Experimental Setup
Multi-Threaded Virtual Automata Simulator (VASim) ✓ METIS – k-way graph partitioning ✓ ANMLZoo and Regex benchmark suites, 1/10 MB input data ✓ Memory compiler 28nm – Area, power, delay of 6T/8T SRAM ✓ Wire Delay (Global wires – 66 ps/mm, H-bus – 300 ps/mm1) Das et al. “SLIP: reducing wire energy in the memory hierarchy”, ISCA’15
55
Performance CA_P and CA_S achieve 9x and 15x speedup over AP respectively
56
Energy
57
Energy
58
Energy
59
Energy
60
Energy 3x lower energy per symbol compared to ideal AP model
61
Design Space AP - 48k states CA – 32k states
62
Design Space CA designs provide higher average reachability, throughput at low area costs
63
Summary ✓ SRAM based last-level caches can accelerate automata processing by 9-15 x over Micron AP Cache Automaton Accelerating state match Sense-amplifier cycling Accelerating state transition Programmable 8-T SRAM switch based interconnect Compiler to automate mapping NFAs to cache arrays
64
Backup
65
Power
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.