Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of.

Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of Michigan – Ann Arbor MICRO-50, Boston, USA – October 17th, 2017

Pattern matching in abundance …
[07/Dec/2016:11:04: ] "GET /HTTP/1.1" "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/ Firefox/49.0" Log Processing / right/JJ to/[^\s]+ / / the/[^\s]+ back/RB / / [^/]+/DT longer/[^\s]+ / / ,/[^\s]+ have/VB / / [^/]+/VBD by/[^\s]+ / Natural Language Processing <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date> </publish_date> <description>An in-depth look at creating applications with XML.</description> </book> XML Parsing # alert tcp $ExXTERNAL_NET any -> $HOME_NET $HTTP_PORTS (msg:"PROTOCOL-SCADA Cogent DataHub server-side information disclosure"; flow:to_server,established; content:".asp."; nocase; http_uri; pcre:"/\x2easp\x2e($|\?)/iU"; metadata:service http; reference:cve, ; classtype:web-application-attack; sid:20174; rev:4;)xx Network Intrusion Detection Video Decoding Donald Duck Donald Fauntleroy Duck Donald F Duck D F Duck Data Analytics /([STX])(.{2}?)([DBEZX])/ /([RKX])(.{2,3}?)([DBEZX])(.{2,3}?)([YX])/ /([GX])([^EDRKHPFYW])(.{2}?)([STAGCNBX])([^P])/ Motif search Particle Path Tracking

Finite State Automata extract patterns
x,\s \n \s, \n x T x \s \n S0 S1 S2 S0 S1 S2 Σ = {x, \s, \n}

Compute-Centric Architectures
Switch-Case 1 while (c != EOF) { 2 if (c == ‘\n’) { putchar('\n'); *state = S0; } 3 else 4 switch(*state) { case S0: if(c != ‘\s’) { putchar(c); *state = S1; } break; case S1: if (c == ‘\s’) { *state = S2; } else { putchar(c); } break; case S2: break; } 13 } Branch Mispredictions ! Irregular memory accesses !

Compute-Centric Architectures
Table-Lookup 1 struct T_table [3][3] = { 2 { {S0, S1}, {S0, S0}, {S0, S0} }, 3 { {S1, S1}, {S1, S2}, {S1, S0} }, 4 { {S2, S2}, {S2, S2}, {S2, S0} } }; 5 6 while (c != EOF) { 7 int id1 = (c == ‘\s’) ? 0 : (c == ‘\n’) ? 1 : 2 8 *state = T_table [*state] [id] [1] 9 putchar(c); 10 } Memory bandwidth bottlenecked ! Limited state transitions per cycle !

Memory-Centric Architectures
Micron’s Automata Processor (AP) 1 rank with 8 chips 48k state transitions per cycle per chip 256x faster than CPU 170x faster than GPU (iNFAnt2) [ANMLZoo, Wadden et al. ’16 ]

In-memory automata processing
start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent

start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent report Each state has incoming transitions on one input symbol

start S1 b i a report S1 S2 S3 S4 S5 b i a start report S6 Equivalent Equivalent start S1 b i a report Homogeneous NFA

State representation .. 1 .. 1 1 .. One-hot encoding
.. 1 97 255 b .. 1 98 255 * 1 .. 255 One-hot encoding 8-bit ASCII alphabet

Input symbols start S1 b i a report

a b a. . . Input symbols start S1 b i a report 1 State-Match

Massively Parallel State-Match
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 Repurpose row address as input symbol Input symbols . . a b a Active State Vector

Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 1 1 1 1 1 Input symbols . . a b a Active State Vector

Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector

a b a. . . Input symbols start S1 b i a report 1 State-Match 2 State-Transition

Massively Parallel State-Transition
Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector &

Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Bit-parallel state-transition Active State Vector Custom interconnect

Repurpose memory columns as FSM states S0 S1 S2 Sn a b 255 1 1 1 1 Repurpose row address as input symbol 1 Bit-parallel state-match 1 1 1 1 1 Input symbols . . a b a Active State Vector Custom interconnect

Are SRAM-based caches suitable for automata processing ?
Cache Automaton ? ! DRAM technology energy inefficient and AP is slow (~133 MHz) Low packing density (12 MB AP = 200 MB regular DRAM) Off-chip host to accelerator communication overhead ! ! Are SRAM-based caches suitable for automata processing ?

Outline Motivation In-Memory Automata Processing Opportunity
Challenges Cache Automaton Compiler Evaluation

Opportunity ✓ SRAM-based caches are faster, more energy efficient than DRAM ✓ Integrated on processor dies with performance-optimized logic ! Is cache capacity an issue for large NFA ?

Opportunity Form Factor Capacity # States DRAM 1 rank (8 devices)
200 MB - DRAM-based AP 12 MB 384,000,000 Last-Level Cache 1 socket 25 MB 400,000,000

Challenges 1 Using cache hierarchy is slow

Using the cache hierarchy is slow !
L1 cache ~ 2 cycles L2 cache ~ 12 cycles L3 cache ~ 36 cycles Credits: Intel Low operating frequency ~ 111 MHz

Challenges 2 Efficient state-match and state-transition in cache

Column multiplexing reduces bit-parallelism
Row decoder 255 Repurpose row address as Input symbol Repurpose memory columns to store FSM states Bit-level parallelism State-match

Scalable interconnect for state-transition in cache
Crossbar ? ! Overprovisioned w/ dynamic arbitration, multi-bit ports ? Infeasible Crossbar 100,000 states 100,000 x 100,000 Network Architecture

In-situ computation cognizant of cache geometry can provide benefits
1

Intel Xeon Last-Level Cache
2.5 MB Slice LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank Tag, State, LRU

2.5 MB Slice 16kB subarray LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder Tag, State, LRU

2.5 MB Slice LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank Tag, State, LRU I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder WL Row decoder 255 8kB SRAM array BL /BLB

2.5 MB Slice 16kB subarray LS CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray 32kB data bank 8kB SRAM array I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder 255 WL Row decoder 255 BL /BLB @ 4 GHz Tag, State, LRU

Sense-amplifier cycling enables highly parallel, low latency state-match
2

Accelerating state-match is challenging …
Column-multiplexing Read sequence ! 4 cycles to read 4 adjacent bits (state-match results)

Sense-amplifier cycling
Read sequence ! 4 cycles to read 4 adjacent bits (state-match results) Read sequence (optimized) ✓ < 2 cycles to read 4 adjacent bits (state-match results)

8T-based SRAM arrays can be repurposed to compactly encode state-transitions
3

Accelerating state-transition
Crosspoint Enable bit 2T read stack Compact 8T SRAM-based switches as building block of programmable interconnect

Real-world NFA can be grouped into dense regions with sparse connectivity
4

Hierarchical network architecture
Hierarchical switch topology scales to large NFA L-switch G-switch Dense Sparse Real-world NFA

Putting it together … L-switch CBOX Way 2 Way 1 Way 19 Way 20
Local Switch Way 2 Way 1 Way 19 Way 20 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0

Cache Automaton Architecture
L-switch G-switch-1 CBOX LS Local Switch Way 2 Way 1 Way 19 Way 20 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0

LS Local Switch CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0 Input buffer Output buffer G-switch-4 G-switch-1 L-switch H-bus Way 2 Way 1 Way 19 Way 20

16b 8b 256 b From G-Switch-1 Row decoder 8-bit Input 255 280x256 L-Switch Match Vector Active State Vector STE(0) STE(1) STE(2) STE(255) To G-Switch-4 Report Vector To G-Switch-1 From G-Switch-4 256 state partition (2 x 4KB SRAM Arrays) LS Local Switch CBOX Way 1 Way 20 Way 2 Way 19 16kB subarray Tag, State, LRU Array_H PA[16] = 1 Array_L PA[16] = 0 Input buffer Output buffer G-switch-4 G-switch-1 L-switch I/O Chunk 62 Chunk 1 Chunk 0 Chunk 63 SRAM 4:1 2:1 Decoder Output bit

Merging connected components
a r * start * start c a r e * c a r e start a t c a r t * start m e l c a m e l * start Merging connected components

Cache Automaton designs
Performance-optimized (CA_P) ✓ Small connected components -> low radix switches Redundant state activity Space-optimized (CA_S) ✓ Low footprint, no redundant state activity Large connected components -> high radix switches and careful mapping

Hierarchical switch topology scales to large NFA
G-switch Dense Sparse Real-world NFA Compiler

k-way graph partitioning
Compiler CC0 (75) CC1(87) CC3(768) CC2(174) CC4(4568) CC0 CC1 CC2 CC3 CC4 Greedy packing k-way graph partitioning 16kB subarray Array_H Array_L Portion of LLC slice with local and global switches Array_H Array_L Way 0 Way 1 Way 2 Way 3 Unmapped 32kB data bank

✓ ✓ ✓ ✓ ✓ Experimental Setup
Multi-Threaded Virtual Automata Simulator (VASim) ✓ METIS – k-way graph partitioning ✓ ANMLZoo and Regex benchmark suites, 1/10 MB input data ✓ Memory compiler 28nm – Area, power, delay of 6T/8T SRAM ✓ Wire Delay (Global wires – 66 ps/mm, H-bus – 300 ps/mm1) Das et al. “SLIP: reducing wire energy in the memory hierarchy”, ISCA’15

Performance CA_P and CA_S achieve 9x and 15x speedup over AP respectively

Energy

Energy 3x lower energy per symbol compared to ideal AP model

Design Space AP - 48k states CA – 32k states

Design Space CA designs provide higher average reachability, throughput at low area costs

Summary ✓ SRAM based last-level caches can accelerate automata processing by 9-15 x over Micron AP Cache Automaton Accelerating state match Sense-amplifier cycling Accelerating state transition Programmable 8-T SRAM switch based interconnect Compiler to automate mapping NFAs to cache arrays

Backup

Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of.

Similar presentations

Presentation on theme: "Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of.

Similar presentations

Presentation on theme: "Cache Automaton Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of."— Presentation transcript:

Similar presentations

About project

Feedback