Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming by Sketching for Bit-Streaming Programs

Similar presentations


Presentation on theme: "Programming by Sketching for Bit-Streaming Programs"— Presentation transcript:

1 Programming by Sketching for Bit-Streaming Programs
Don’t’ read the tile. PERFORM  IMPLEMENT Sketching gives the programmer the ability to tell the compiler the high level features of an implementation with the system filling in the details from a separate high level specification. This way, you can allow programmers to specify task at a high level, but at the same time maintain control over how their task is implemented in a way that is independent from the high level specification, that is safe, and that is economical. Sketching in other words, is a way to produce clean correct programs that are implemented the way we want them to. This very much describes the software quality agenda of the last 50 years, and many other methodologies have been proposed to achieve this. Armando Solar-Lezama, Ras Bodik UC Berkeley Rodric Rabbah MIT Kemal Ebcioglu IBM

2 Verification, synthesis, sketching
Verification: does your program implement the spec? user responsible for low-level implementation details redundancy: implementation restates aspects of spec Synthesis: produce a program that implements the spec say what not how say it only once Sketching: synthesis + partially described implementation spec is executable → easy to debug programmer sketches the implementation One of the traditional approaches has been Verification. Verification has two drawbacks, first, it’s very hard, so often you settle for only partial correctness. The second drawback is a productivity issue. Developers have to write a specification, and independently produce a full implementation. It’s up to the developer to make the implementation match the specification. A second approach is that of Synthesis. You write the spec, and leave it up to the system to develop the implementation from it. This gives you clean correct programs, that will often not perform the way you want them to, because you have no way to communicate to the system low level aspects of the implementation. Sketching can be summarized as synthesis with a partially specified implementation. The user provides a specification, which can be executed, so it’s easy to debug and test. Once the spec is done, you can independently provide a sketch of the implementation that gives the high level features but omits the heary details of it. The compiler can then fill in those details to ensure the sketch implements the sketch

3 The sketching experience
program = completed sketch So what do we mean by a sketch? A sketch is an incomplete description of the final product you want to implement. It contains the high level features that you want your implementation to have, but it leaves it up to the system to fill in the details. Transition: SKETCHING, AT LEAST FOR NOW, IS DOMAIN-SPECIFIC METHODOLOGY sketch

4 Case study domain: bit-stream programs
Manipulate a stream at the bit level crypto: DES, Serpent, AES, …, code breaking coding: error correction Implementation Gap easy to describe, difficult to implement Operate under strict constraints performance is very important up to 95% of server cycles spent in security-related processing correctness is crucial subtle bug in Blowfish implementation allowed cracking over half the keys in less than 10 minutes NEED PUNCHLINE Let me show you now, on our running example, how sketching helps in this demanding domain.

5 Running example DropThird: “Drop every third bit in the bit stream.” exhibits many features of complicated permutations implementation Gap number of implementation:: exponential in word size In StreamBit, fast implementation can be sketched sketch ? SLOW O(w) FAST O(log w) GET  SYNTHESIZED Through out the talk I will be using the following running example to illustrate how the system works. It’s called DropThird, and can be concisely stated as “Drop every third bit in a stream.” The parity bits. This simple example exhibits many of the problems of the more complicated bit manipulations. In particular, you have this implementation gap. The high level description is very simple, but the implementation difficulty comes from the fact that the machine can only manipulate bits one whole word at a time, and the performance will vary widely depending on how much we take advantage of this, and there will be exponentially many choices on how to do this mapping. 3: For example, we can use logical shifts to shift one set of bits at a time. Using these scheme, the number of shifts will be proportional to the length of the word. 4: By contrast, we can close half the gaps between bits on each timestep and get a scheme that takes time proportional to the log of the word size. This may not sound like much, but on a 64 bit machine, the scheme on the right can be 3.5 times faster than the scheme on the left. So where does sketching come in? 5-7: What our system allows you to do is to specify this task at a high level as simply “drop every third bit”, and then independently sketch the implementation strategy. The system can then combine the sketch with the high level specification to give you the implementation you want. functionality FAST implementation +

6 Two sketches needed for DropThird
The log-shifter: Decompose( [shift(1:16 by 0 || 1)], [shift(1:16 by 0 || 2)], [shift(1:16 by 0 || 4)] ) Smart packing of input stream to machine words: Decompose( [shift(1:2 by 0), shift(17:18 by 0), shift(33:34 by 0)], [shift(1:16 by ?), shift(17:32 by ?), shift(33:48 by ?)] ) Two sketches synthesize a high-quality implementation: 32 bit on a Pentium IV: fold speedup 64 bit on an Itanium II: fold speedup

7 … compare with Fortran 100+ lines Size: 13 lines WSIZE=16;
subsequence = Unroll[WSIZE](subsequence); subsequence = PermutFactor[ [shift(1:2 by 0), shift(17:18 by 0), shift(33:34 by 0)], [shift(1:16 by ?), shift(17:32 by ?), shift(33:48 by ?)] ] ( subsequence ); subsequence.subsequence_1=DiagSplit[WSIZE](subsequence); for(i=0; i<3; ++i) { subsequence.subsequence_1.filter(i) = PermutFactor[ [shift(1:16 by 0 || 1)], [shift(1:16 by 0 || 2)], [shift(1:16 by 0 || 4)] ]( subsequence.subsequence_1.filter(i) ); } Size: 13 lines 100+ lines DIMENSION MASKB1(INC), MASKB2(INC), MASKB3(INC), MASKB4(INC) DATA MASKB1 /Z'F81F03E07C0F81F0', Z'3E07C0F81F03E07C', $ Z'0F81F03E07C0F81F', Z'03E07C0F81F03E07', Z'C0F81F03E07C0F81', Z'F03E07C0F81F03E0', $ Z'7C0F81F03E07C0F8', Z'1F03E07C0F81F03E', Z'07C0F81F03E07C0F', Z'81F03E07C0F81F03', $ Z'E07C0F81F03E07C0'/ DATA MASKB2 /Z'FFC003FF000FFC00', Z'3FF000FFC003FF00', $ Z'0FFC003FF000FFC0', Z'03FF000FFC003FC0', Z'FE001FF8007FE001', Z'FF8007FE001FF800', $ Z'7FE001FF8007FE00', Z'1FF8007FE001FF80', Z'07FE001FF8007FC0', Z'FC003FF000FFC003', $ Z'FF000FFC003FF000'/ c Move word 1 into position TC = ISHFT(CBUF(1,I), SLC(1)) c Move first part of word 2 into position TC = TC + ISHFT(CBUF(2,I), SRC(2)) c Move word 3 into position into position and output 1st word of output C(1,I+K) = TC + ISHFT(CBUF(3,I), SRC(3)) c Move last part of word 3 into position TC = ISHFT(CBUF(3,I), SLC(3)) c Move word 4 into position TC = TC + ISHFT(CBUF(4,I), SRC(4)) Compare this with specifying the implementation with a more traditional programming language. I can tell you the comparison is not pretty. In this case, you would have to compute all the bit masks and bitwise operations by hand, and if you want a different implementation strategy, you have to do it from scratch.

8 Exploring different implementations
A permutation from DES cipher (64 bits  64 bits) 32 bits shift(1:64 by 0 || 33 || -33), shift(1:2:31 by -33), shift(34:2:64 by 33), [] // unspecified, synthesized Sketch trail and error with different implementations What do I want to emphasize here? I want to emphasize the fact that I can default implementation in terms of shifts and masks can implement with table lookups what if I do a simple permutation in the first step? two step filter is more efficient each sub-permutation can be implemented with the same table ¼ of the space with half the lookups!

9 How sketching works Now, a little bit about how sketching works.

10 The specification An executable specification x 1 0 0 y 0 1 0 z =
a StreamIt program (synchronous dataflow) Learn more about StreamIt, see their LCTES and PPOPP talks filters represented internally as matrices bit->bit filter DropThird { work push 2 pop 3{ push(pop()); pop(); } 3 2 consumes a 3-bit chunk of input; produces a 2-bit of output. 1 0 0 0 1 0 x y z = As I mentioned before, the user will start by providing a specification in the form of a high level program written in StreamIt. The compiler internally, will represent each of these filters as a matrix or a sequence of matrices.

11 The sketched synthesis problem
spec: sketch of implementation: ? Now, we can frame the problem of synthesizing a problem of synthesizing a sketch as follows: Given a specification written as a high level program in StreamIt, and a Sketch, can we synthesize target-machine code such that it follows the sketched implementation steps? We will address the problem one step at a time. problem: synthesize target-machine code such that it follows the sketched implementation steps

12 Easier problem 1: Base Compilation
spec: sketch of implementation: none problem: compile the spec for the target machine First, we start with an easier problem. Given a spec, and no sketch, how do we compile the spec into target-machine code?

13 Implementation expressible in StreamIt
relevant hardware instructions are dataflow filters: duplicate or in w x y z x y z 0 w x 0 0 0 0 z 0 w x z 0 in w x y z duplicate or filter(in) { w x y z t2 = in SHIFTL 1 x y z 0 w x y z t1 = in AND 1100 IMPLE LIKE THIS ONE The first key insight into this problem is the fact that the implementation itself can also be expressed in StreamIt, the same language we are using to express the spec. Here is an example. t3 = t2 AND 0010 w x 0 0 x y z 0 out = t1 OR t3 0 0 z 0 return out } w x z 0 out

14 Base compiler space of all programs implementations
spec This has some interesting implications. In particular, if we think of the space of all possible StreamIt programs, we can order them by how low-level they are, and we can separate those that are so low level that each filter in them corresponds to one basic operation in our model machine. Now, we can focus on those that implement the task we are interested in, in this case dropThird. For each point in the space, there is a path consisting entirely of local substitutions on the AST of the program, that will take you to a point in the implementation region of the space. implementations more decomposed (into low-level steps)

15 How It Works Example: Drop Third Bit (word size W = 4 bits)
t1 = in AND 1100 t2 = in SHIFTL 1 t3 = t2 AND 0010 out = t1 OR t3 in Example: Drop Third Bit (word size W = 4 bits) unroll filter decompose into filters operating on W=4 bits of input. decompose into filters producing W=4 bits of output duplicate or o p q r s t u v w x y z o p q r s t u v w x y z We can see this more concretely in the case of dropThird. x y z o p q r s t u v w x y z opr00000 000suv00 000000xy x y o p r s u v x y o p r s u v x y

16 Easier problem 2: adopting an implementation
spec: fragment of implementation (not a sketch): problem: synthesize target-machine code such that it uses the provided implementation steps

17 Adopting provided implementation
spec implementations

18 Adopting provided implementation (DropThird)
This slide also takes about two minutes to go through.

19 Finally, the sketched synthesis problem
spec: sketch of implementation: ?

20 Adopting provided implementation
spec implementations

21 Nudging the base compiler
? + This slide also takes about two minutes to go through.

22 Sketch resolved using Boolean matrix algebra
function to implement sketch completed sketch ? + ? ? ? M [shift(1:16 by 0 || 1)] x [shift(1:16 by 0 || 2)] x [shift(1:16 by 0 || 4)] M3 x M2 x M1 filter-to-pipeline decomposition = matrix factorization: M = M3 x M2 x M1 (guarantees correctness) sketch gives constraints on factors factorization approach: constrain solving followed by search Use constraints to narrow search space

23 Evaluation of Approach

24 Evaluation Goals Time to First Solution
how quickly can we develop a first reference solution? Performance of Base Compilation can base compiled code compete with handwritten C? Benefits of Sketching how good is the quality of code produced through sketching? Sketching Vs. Expert Tuning can sketching compete with professionally tuned code? In this slide I describe how the user study was conducted, and in the next slide I describe the results. This takes about 3 minutes.

25 1) Time to develop the spec
How quickly can the specification (first solution) be developed? Avg The first question we want to answer is how quickly can the original program be developed using our system. This is very important because one of the main objections to any solution that involves learning a new language is the fact that the learning curve for the language may trump the productivity gains the system may deliver. In order to evaluate this, we organized a small user study where people were asked to implement a simple cipher based on 3 Feistel rounds. Some people were assigned to write in C and some were assigned to write in StreamIt The graph shows the time to develop the first working version of the program, both for people using our compiler and people using C.

26 2) Performance of Base Compilation
Can base compiled code compete with handwritten C? StreamBit submissions C submissions (each line = one programmer) Time in hours The second goal of our user study was to evaluate the performance of our base compilation algorithm. In other words, if I use StreamBit to implement a cipher and give it a description of a cipher, but I don’t bother to provide a sketch, how will the implementation fare compared with optimized C. The result can be seen in the graph. Note that the C participants were encouraged not just to produce a solution, but to try to optimize it; even then, they only barely reached half the throughput of the original StreamBit submissions.

27 3) Benefits of Sketching
How much performance can sketching get beyond baseline compilation? C programmers sketch-based implementation expert-tuned C implementation time (hours) So the performance of the base compilation without sketching is already pretty good, so how much more can it be improved through sketching? It turns out that quite a bit. After the user study was done, I took one of the StreamBit submissions, and used sketching to implement a set of pre-specified optimizations, and managed to almost triple the performance. As a point of comparison, Dave developed for us an optimized version in C, also starting from a working reference implementation.

28 4) Sketching vs. expert tuning
Can sketching compete with professionally tuned code? Can we match the best DES implementation (libDES)? only 17% slower on 32-bit machines (we can fix this) faster on 64-bit machines We’ve seen that the baseline compilation itself is very good, and that sketching can produce great performance improvements even beyond that, but how well does sketching do when compared with the “professionals”. PUNCHLINE processor Pentium 4 Pentium III Sparc IA64 IBM SP sketched vs. libDES 0.91 0.83 1.06 1.08

29 Related Approaches and Conclusion

30 Can we do without the sketch?
functionality FAST implementation + Search-based optimization (a’ la Atlas) synthesize and evaluate all implementations unconstrained by the sketch, search space becomes huge Classical optimization hard-code log-shifter as a typical optimization phase more work than writing the sketch, needs compiler expertise CALCULATE SPACE or TIME

31 Sketches allow global control over compilation
Log-shifting Sketch Pack within words Sketch spec

32 Conclusion Spec + sketch Sketching worked in this domain because :
separation of aspects: correctness vs. performance separation of roles: domain experts vs. system experts (crypto) (perf. programmer) Sketching worked in this domain because : large gap between specification and implementation algebra of program transformations Other Domains? Graphics, Scientific Kernels, Media Codecs HIGH NOTE!!!!


Download ppt "Programming by Sketching for Bit-Streaming Programs"

Similar presentations


Ads by Google