CS240B—Fall 2018 Task 4.1. Express the Flajolet-Martin's distinct_count sketch as a user-defined aggregate mamed dcount_sketch, to be called in the same way as d_count. You can assume that you have available a function LmostbitH(X) that return K, where the K position contains a 1, whereas all the position to its right are zeros, for the value returned by a randomizing hash function H(X). We will design a window aggregate that e.g., could be called as follows: SELECT col_name1, dcount_sketch(col_name2)OVER (ROWS 99999 PRECEDING) FROM my_stream;
FM dcount_sketch WINDOW AGGREGATE dcount_sketch(next Real) : Real { TABLE bitarray (bitpos int, bitvalue int); TABLE inwindow(wnext Real); INITIALIZE : {insert into bitarray VALUES (1,0), …, (64, 0); update bitarray SET bitvalue=1 WHERE bitpos= LmostbitH(next )} ITERATE : {/*the system inserts the new tuple in invindow at the end of iterate*/ update bitarray SET bitvalue=1 WHERE bitpos= LmostbitH(next)}; DELETE FROM inwindow WHERE LmostbitH.wnext=LmostbitH(next) ; INSERT INTO RETURN SELECT 2** MAX(bitpos) /*the estimated count*/ FROM bitarray WHERE BITVALUE=1 %we could also delete weak bits---e.g. those that are less than max-8 %DELETE FROM inwindow WHERE bitpos< MAX(bitpos)-8} EXPIRE: { /*Expire is processed before iterate*/ UPDATE bitarray SET bitvalue=0 WHERE bitpos=(SELECT LmostbitH(wnext) FROM inwindow WHERE oldest(inwindow) )} }
Task 4.2: Assume that you have a stream of temperature readings temperature(Celsius Integer) that start everyday at time 00:01 and end at time 23:59. At the end of each day, we want to have 10,000 temperature samples stored into a table tenKsamples(Rowno integer, Celsius Integer).We do not know how many temperature readings are going to arrive every day, except that their number is significantly larger than 10,000. Please write a UDA that uses the reservoir algorithm to populate tenKsamples(Rowno , Celsius) with 10,000 random samples taken from temperature(Celsius Integer), which is then processed and reset to empty at midnight. You can assume that the system support a function random(K), which given a positive integer K returns a random integer between 1 and K.
AGGREGATE reservoir(next integer) : integer { TABLE tenKsamples(Rowno integer, Celsius Integer) external; TABLE cntuples (cnt Integer); INITIALIZE : {insert into cntuples values 1; insert into tenKsamples values (1, next); ITERATE : {update cntuples set cnt=cnt+1; Insert into tenKsamples select (cnt, Next) from cntuples where cnt<10000; UPDATE tenKsamples set Celsius=next, where Rowno= random(10000) and 1= select(random(cnt), from cntuples where cnt>10000) Terminate: {%we might want to return the count}