Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Similar presentations


Presentation on theme: "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."— Presentation transcript:

1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium Flexible Approximate Counting

2 Outline What is approximate counting? What’s new? –Functional form –Increment decision strategies Speed it up! Random number and bit generators –Inverse problem Find function given how high you want to count (Focus on red since that’s what’s significant)

3 What is approximate counting? Approximate counter C –Trade decreased memory for decreased accuracy –Standard (unsigned) integer or bit field, but C represents some bigger number N –Normal integers use log 2 N bits to represent 0..N –Counter C can use log 2 (log 2 N) bits to represent 1..log 2 N Accurate to within a factor of 2 “Count” to 2^(2 8 ) using 8 bits N= φ(C) function unary  binary  floating point 100110 Count using only the exponent

4 What is approximate counting? Count occurrences of datastream objects, pairs of IP addresses Problem –Object arrives, decide whether to increment N+1 = ? if you only stored C? C=4, N=16. Choose 16+1 = 16 or 16+1 = 32  ? Solution –Coin flipping. 16+1 = 32 with probability p = 1/(32-16) –Flajolet papers prove expected value and error are reasonable, 1985-2004+ –Two sources of error Unavoidable: intermediate numbers not representable. Constant-factor approximation. Datastream: can’t view all the data at once, random decisions. Expected error bounds. p=1/(32-16)

5 Motivation Old idea (memory-accuracy) with some new uses –Morris 1978, one small register on a CPU –Today big data, lots of counters Data-summarization –Approximate Counting useful by itself, for counting all objects Database merge –Choose most efficient algorithm, pre-allocate memory –May be combined with other techniques Bloom filters –Replace 1-bit with a small counter, Van Durme & Lall 2009 –Spread counter into multiple bits of a Bloom filter, Talbot 2009 vary the number of bits for skewed data,

6 Generalize Function q-ary counting and Floating Point AC ΔN = 2 C. Why base 2? –p=2 -C  Use fast random-bit source for increment decisions Csűrös 2010 –Treat counter as binary-exponent floating point number Exponent gives powers-of-two increment probabilities Significand gives better accuracy than base 2 –Stair-step approximation to “q-ary” counting: –I.e. Restricted to 9 choices for 8-bit counters –First contribution Get these advantages… …without these restrictions 0100 0110 8-d bits exponent d-bits signficand

7 Our Flexible AC Flexible AC –Perfect counting below a threshold T, then –ΔN = a C-T. p=1/a C-T, a is any floating point value. –a small (<2) since 255 = log 2 (5.7e76) –Round ΔN to integer Still get prior speedups Round all ΔN to powers-of-two If speed(RandomBit) < ½ speed(RandomNumber)

8 Random Bit Generator Many well-tested random number generators –Fewer random bit generators Knuth vol. 2 eq 10 – very simple (fast!) A = x0102010081010101 //64-bit constant X = X << 1 //shift left If overflow X = X xor A RandomBit = X & 1 // lowest bit of X –A is your choice of primitive polynomial mod 2 with many one-bits: 8 out of 64, Rajski & Tyszer 2003 –Every length-64 bit-sequence occurs once before repetition Consider accuracy in terms of intended use. What matters for our application –k one-bits in a row occurs 1 in 2^k times –Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42 times verified experimentally

9 Speed Comparison If this is embedded in a datastream application, speed may be important. Random number generator is the bottleneck (goal is incrementing a counter!) if RandNumber < p increment //p = 2^{-k} if k RandomBits in a row increment

10 Random Countdown Speedup Why generate a random number every time? –Set countdown counter P P = number of times in a row RandNumber > p [no increment] –Need one countdown counter per counter value (1..255) not per counter (billions) –Calculating P is (relatively) very expensive Fast on average if P is large  p is small Hybrid algorithm –RandNumber < p? or RandomBit for small p –Random Countdown for large p –“small” means <10 or <22 –This is the definition of a geometric distribution

11 Fixed Countdown Speedup Why generate a random number at all? –Increment “1 in Δφ” times deterministically Slightly different value to get correct expected value Best possible accuracy if only one item Fastest  Relies on randomness of stream –E.g. alternating items  bad counts

12 Speed: RandomCount  FixedCount RandomCount = 1.5x Fixed Count for Δ φ=255 Random Count = ¼x RandomBit for Δ φ=172 Punchline

13 How High Do You Want to Count? Inverse problem (David M. Day) Find a, never discussed in approximate counting literature –For some applications, determine by hand ahead of time –Our run-time solution –Inverse geometric sum tricky case Find root >1 for r(a) Initial guess depends on s compared to K. I.e. a K+1 vs. sa vs. (s-1) const

14 Inverse Problem Alternatives We’re only approximately counting, –So accuracy may not be important We only calculate function once, –So efficiency may not be important (Application dependent) –Use the initial guesses –Use binary search or lookup table –Use N=φ(C) function with easier inverse E.g. exponential + linear function, but increments are too small for small C

15 Conclusion Flexible Approximate Counting provides –Customization of functional form At run-time, for maximum value to count to –Fast decisions of whether to increment If datastream is sufficiently random –Use fixed countdown Else –Switch to random countdown for large increments If speed is more important than accuracy for small increments –Use random bits and power-of-two increments Random generator accuracy limits –Consider the intended use RandNumber Min r : probability(u<r) ≈ r RandomBit Max k: probability(k one-bits in row) ≈ 2 -k Thank you –Have a safe trip home


Download ppt "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."

Similar presentations


Ads by Google