Download presentation
Presentation is loading. Please wait.
1
StreamBit: Sketching high-performance implementations of bitstream programs
Armando Solar-Lezama, Rastislav Bodik UC Berkeley In Collaboration with Kemal Ebcioglu, Vivek Sarkar, IBM TJ Watson Rodric Rabbah, MIT.
2
Bitstream Programs Bitstream programs: A Growing Domain
Crypto: DES, Serpent, Rijndael, … coding in general NSA/BitTwiddle Bitstream programs operate under strict constraints Performance is very important Resource constrained environments like smartcards Important component of server loads Correctness is crucial Subtle bug in Blowfish implementation allowed over half the keys to be cracked in less than 10 minutes Efficient bitstream programs are hard to write
3
Current State of the art
Example from DES filter The 64 bits of the input block to be enciphered are first subjected to the following permutation, called the initial permutation IP: IP That is the permuted input has bit 58 of the input as its first bit, bit 50 as its second bit, and so on with bit 7 as its last bit. Description from NIST document Code from LibDES Lots of hand compilation Tedious Error prone Not portable: optimizing for different word-size non-trivial Difficult to automate Exponentially many choices Greedy is Bad
4
We can do better Separate specification from implementation How?
Specify the task without regard for performance Specify Implementation without introducing bugs How? High level program encodes Task description Sequence of transformations from Task description into equivalent Low Level program encodes an implementation Transformations can be Sketched! Implementation Task description
5
Running Example + “Drop every third bit in the bit stream.”
Exhibits many features of complicated permutations Exponentially many choices Greedy choice is suboptimal Fast implementation can be sketched sketch ? SLOW O(w) FAST O(log w) functionality FAST implementation +
6
What you gain Drop Third Benchmark: DES Benchmark:
Speedups over naïve code with a 14 line sketch: 32 bit on a Pentium IV: 83.8% 64 bit on an Itanium II: 233% DES Benchmark: 32 bit on a Pentium IV with 30 line sketch: 634% speedup over naïve Over ¾ the performance of hand optimized libDES
7
The DSL: StreamIt Task description written in StreamIt
StreamIt is a Dataflow language Filters are the main unit of abstraction Statically defined I/O rate for each filter Filters can “look ahead” in the input stream Filters can be composed Pipelines Splitjoins See for more info parallel computation split join
8
Task Description in StreamIt
Example: “drop every third bit”. Filter can be represented as matrix 3 2 consumes a 3-bit chunk of input; produces a 2-bit of output. filter dropThird { work push 2 pop 3 { for (int i=0; i<3; ++i) { x = pop(); if (i<2) push(x); } 1 0 0 0 1 0 x y z = The key message of this slide is that bitstream manipulations (filters) can be represented as linear matrix-based transformations.
9
From Text to Matrices Start with a StreamIt program
tmp1 = !(in[1]) AND in[0]; x2 = (in[1])OR tmp1; tmp2 = (x2 AND in[2]); out[0] = ( tmp2 ); tmp3 = !(in[2]) AND in[1]; x4 = (in[2]) OR tmp3; tmp7 = (x4 OR in[2]); out[1] = tmp7; tmp4 = (x4 AND in[2]); out[2] = ( tmp4 ); tmp5 = !(in[0]) AND in[2]; x6 = (in[0]) OR tmp5; tmp6 = (x6 AND in[2]); out[3] = ( tmp6 ); Start with a StreamIt program Unroll loops, propagate constants, and eliminate dead code. Simplify Get a dataflow graph t = peek(2); x1 = peek(0); x2 = (peek(1))? 1 : x1; push( (x2)? (x2 AND t) : (x2)); x3 = peek(1); x4 = (peek(2))?1 : x3; push(x4 OR t); push((x4) ? (x4 AND t) : (x4)); x5 = peek(2); x6 = (peek(0))? 1: x5; push( (x6) ? (x6 AND t) : (x6)); t = peek(2); for(i=0; i<3; ++i){ x = peek(i); if(peek(i+1 % 3)) x = 1; if(i==1) push(x OR t); if(x) push(x AND t); else push(x); }
10
From Text to Matrices Create the Data Flow graph
Label each node with its operation Separate into layers by node type Each Layer Corresponds to a matrix Green Layers are permutations Pink layers are and/or/xor matrices In0 In1 In2 AND OR Out0 Out3 Out2 Out1 In0 In1 In2 tmp1 x2 tmp2 Out0 tmp3 x4 tmp4 x6 tmp6 Out3 Out2 Out1 tmp5 tmp7 In0 In1 In2 AND OR Out0 Out3 Out2 Out1
11
From Text to Matrices = 1 = 0 + is and * is or = 1 = 0 + is or
In0 In1 In2 AND OR Out0 Out3 Out2 Out1 = 1 = 0 + is and * is or = 1 = 0 + is or * is and
12
Implementation in StreamIt
Make each filter correspond to one basic operation available in the hardware Example t1 = in AND 1100 t2 = in SHIFTL 1 t3 = t2 AND 0010 out = t1 OR t3 in duplicate or
13
Greedy Transformation
Defines a default implementation for any program Leveraged by the user for custom implementations Example: Drop Third Bit Unroll filter decompose into filters operating on W=4 bits of input. decompose into filters producing W=4 bits of output
14
Greedy Transformation (cont)
Why is it Greedy? Will try to pack as many iterations of a filter in a word as it can Will shift bits to their final position with a single shift Will shift together all the bits that have to be shifted by the same amount
15
The Full Picture For every task there is a space of programs that describe it Greedy Transformation algorithm defines a path from any program to an implementation Halfway Transformation can be used to get a different implementation A Sketch can simplify the process of writing a transformation Implementations Level of abstraction (high low) Task Description
16
Half way transformations
Allows the user to guide lowering Impart structure to the filter Bring it closer to the desired implementation Add structure by decomposing into a pipeline of steps Equivalent to specifying a matrix decomposition. filter = filter1 filter2 guarantees correctness This can be done hierarchically, Detail is only added where necessary. The TSL operator PermutFactor instructs the system to decompose the filter named ‘filter’ into two filters. The first filter can move bits any way it desires, but it is prohibited from moving either bits 1,2, 16, 17, 33, or 34. The second filter must shift all the bits within a word by the same amount. With these constraints, the system determines that in order to satisfy both constraints, in the first step it must pack all the bits within the word, and on the second step it can pack bits across words.
17
Half way transformations-example
Half way transformation to specify FAST bit shifting algorithm: F.F_1 F F.F_1 F.F_2 F.F_3 F.F_2 User provides high level decomposition System Takes care of Lowering F.F_i Correctness is guaranteed as long as [F.F_3] [F.F_2] [F.F_1] = F Avoid spelling out the decomposition: Sketch It! F.F_3
18
Sketching: How it works
Start with a sketch Define xi,j as the amount bit i will move on step j Semantic equivalence imposes linear constraints on the xi,j Many of the constraints in the sketch also impose linear constraints on xi,j Solving the linear constraints produces a space of possible solutions Map the nonlinear constraints to this solution space Search SketchDecomp[ [shift(0:31 by 0 || 1)], [shift(0:31 by 0 || 2)], [shift(0:31 by 0 || 4)], [shift(0:31 by 0 || 8)] ]( Filter );
19
Solving sketches for permutations (1)
b1 b2 b4 b5 b7 b8 Step 1 T Step 2 M Start with a sketch SketchDecomp[ [shift(1:8 by 0 || 1), shift(1 by 0)], [shift(1:8 by 0 || 2), shift(6,7,8 by ?)], ]( Filter ); We solve M v = T M and T define linear constraints v specifies how much a bit moves at every step v=[S][a]+[P] will satisfy the linear constraints for any [a] We must find v satisfying the nonlinear constraints 1 2 1 1 -1 1 2 PT = 1 -1 S1T= 1 -1 S2T= 1 -1 S3T= 1 -1 S4T=
20
Solving sketches for permutations (2)
1 -1 [S] v1,2 -0 v1,4 -1 v1,5 -1 v1,7 -2 v1,8 -2 v2,2 -0 v2,4 -0 v2,5 -0 v2,7 -0 v2,8 -0 v2,1 -0 v1,1 -0 [v-P] SketchDecomp[ [shift(1:8 by 0 || 1), shift(1 by 0)], [shift(1:8 by 0 || 2), shift(6,7,8 by ?)], ]( Filter ); Let [v]= [S][a]+[P] [v] must satisfy nonlinear constraints Then [S][a]=[v-P] Gausian elimination leads to constraints in [v] [v] satisfying constraints is our solution v1,1…v1,8=(1 OR 0) v2,1…v2,8 =(2 OR 0) v1,1=0 v2,1=0 v1,2+v2,2=0 v1,4+v2,4-1=0 v1,5+v2,5-1=0 v2,7-v2,8=0 v2,8+v1,8-2=0 v2,8+v1,7-2=0
21
More on SketchDecomp The SketchDecomp function specifies each step as a list of constraints SketchDecomp[ [constrList], [constrList], …] Constraints of 4 types: Type 1: specific shift amount shift( bitList by x) Type 2: undetermined shift amount shift( bitList by ?) Type 3: limited choice of shift amount shift( bitList by a || b || …) Type 4: Specify the position of the bits pos( bitList to posList) System ignores constraints on discarded bits
22
Programming with StreamBit
Code from LibDES StreamIt Code Description from NIST document StreamBit program close to original task description Clever algebraic transformation expressed as a sketch SketchDecomp[ [ shift(0:2:30 by -33), shift(33:2:63 by 33), shift(0:63 by 0 || 33 || -33) ], [ ] ] (DES Encoding.IP); Sketch
23
Productivity The user study
Participants were given to program a small 24-bit-key Cipher Participants were divided at random into two groups One group would program C One group would program in StreamIt Participants can keep submitting optimized versions of their code for as long as they want
24
Programs in C Wide variation in the performance
Speedups on the order of 1.5x through performance tuning Some people experienced performance degradation
25
Programs in StreamIt Performance was higher than the highest performance in C Very small variation in performance Development time comparable to C Surprising given infrastructure problems people were not experienced in the language
26
What about Sketching? After the competition was over, SBit 2 was tuned using sketching The sketching process took about an hour Some of the optimizations attempted degraded performance 3.5x the speed of the fastest C implementation
27
Productivity User still has to come up with clever idea
People are good at clever ideas Machines are good at tedious details Sketching allows user to experiment with different implementations Don’t have to worry about bugs
28
Concluding Remarks StreamBit allows for
Task specification oblivious to performance Implementation specification without bugs Same idea may apply in other domains If people currently resort to very low level coding If some algebraic structure can be imposed on the task It may be amenable to Sketching.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.