Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
IBM Labs in Haifa © 2005 IBM Corporation Adaptive Application of SAT Solving Techniques Ohad Shacham and Karen Yorav Presented by Sharon Barner.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Resource Prediction Based on Double Exponential Smoothing in Cloud Computing Authors: Jinhui Huang, Chunlin Li, Jie Yu The International Conference on.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Planning under Uncertainty
Evaluating Search Engine
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Scott Grissom, copyright 2004 Chapter 5 Slide 1 Analysis of Algorithms (Ch 5) Chapter 5 focuses on: algorithm analysis searching algorithms sorting algorithms.
Factoring 1 Factoring Factoring 2 Factoring  Security of RSA algorithm depends on (presumed) difficulty of factoring o Given N = pq, find p or q and.
The Islamic University of Gaza Faculty of Engineering Civil Engineering Department Numerical Analysis ECIV 3306 Chapter 3 Approximations and Errors.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
CSE 326 Randomized Data Structures David Kaplan Dept of Computer Science & Engineering Autumn 2001.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Computer Arithmetic Integers: signed / unsigned (can overflow) Fixed point (can overflow) Floating point (can overflow, underflow) (Boolean / Character)
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Number Systems - Part II
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”
Computing Systems Basic arithmetic for computers.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
1 Ethics of Computing MONT 113G, Spring 2012 Session 13 Limits of Computer Science.
The Power of Incorrectness A Brief Introduction to Soft Heaps.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
Analysis of Algorithms
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
Random Number Generators 1. Random number generation is a method of producing a sequence of numbers that lack any discernible pattern. Random Number Generators.
CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
Quick and Easy Binary to dB Conversion George Weistroffer, Jeremy Cooper, and Jerry Tucker Electrical and Computer Engineering Virginia Commonwealth University.
Computer Architecture Lecture 26 Fasih ur Rehman.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Optimizing Robustness while Generating Shared Secret Safe Primes Emil Ong and John Kubiatowicz University of California, Berkeley.
More on Logarithmic Functions 9.6
Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.
Amortized Analysis and Heaps Intro David Kauchak cs302 Spring 2013.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong
Ariel Rosenfeld.  Counter ranges from 0 to M requiers log 2 M bits.  For large data log 2 M is still a lot.  Using probability to reduce to log 2 log.
CHAPTER 5: Representing Numerical Data
CMPT 438 Algorithms.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
University of Waikato, New Zealand
Design & Analysis of Algorithm Hashing
Data Transformation: Normalization
Updating SF-Tree Speaker: Ho Wai Shing.
B-Trees B-Trees.
Floating Point Numbers: x 10-18
The Variable-Increment Counting Bloom Filter
Pyramid Sketch: a Sketch Framework
Database Design and Programming
Amortized Analysis and Heaps Intro
Floating Point Binary Part 1
Approximate Counting Algorithm
Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium Flexible Approximate Counting

Outline What is approximate counting? What’s new? –Functional form –Increment decision strategies Speed it up! Random number and bit generators –Inverse problem Find function given how high you want to count (Focus on red since that’s what’s significant)

What is approximate counting? Approximate counter C –Trade decreased memory for decreased accuracy –Standard (unsigned) integer or bit field, but C represents some bigger number N –Normal integers use log 2 N bits to represent 0..N –Counter C can use log 2 (log 2 N) bits to represent 1..log 2 N Accurate to within a factor of 2 “Count” to 2^(2 8 ) using 8 bits N= φ(C) function unary  binary  floating point Count using only the exponent

What is approximate counting? Count occurrences of datastream objects, pairs of IP addresses Problem –Object arrives, decide whether to increment N+1 = ? if you only stored C? C=4, N=16. Choose 16+1 = 16 or 16+1 = 32  ? Solution –Coin flipping = 32 with probability p = 1/(32-16) –Flajolet papers prove expected value and error are reasonable, –Two sources of error Unavoidable: intermediate numbers not representable. Constant-factor approximation. Datastream: can’t view all the data at once, random decisions. Expected error bounds. p=1/(32-16)

Motivation Old idea (memory-accuracy) with some new uses –Morris 1978, one small register on a CPU –Today big data, lots of counters Data-summarization –Approximate Counting useful by itself, for counting all objects Database merge –Choose most efficient algorithm, pre-allocate memory –May be combined with other techniques Bloom filters –Replace 1-bit with a small counter, Van Durme & Lall 2009 –Spread counter into multiple bits of a Bloom filter, Talbot 2009 vary the number of bits for skewed data,

Generalize Function q-ary counting and Floating Point AC ΔN = 2 C. Why base 2? –p=2 -C  Use fast random-bit source for increment decisions Csűrös 2010 –Treat counter as binary-exponent floating point number Exponent gives powers-of-two increment probabilities Significand gives better accuracy than base 2 –Stair-step approximation to “q-ary” counting: –I.e. Restricted to 9 choices for 8-bit counters –First contribution Get these advantages… …without these restrictions d bits exponent d-bits signficand

Our Flexible AC Flexible AC –Perfect counting below a threshold T, then –ΔN = a C-T. p=1/a C-T, a is any floating point value. –a small (<2) since 255 = log 2 (5.7e76) –Round ΔN to integer Still get prior speedups Round all ΔN to powers-of-two If speed(RandomBit) < ½ speed(RandomNumber)

Random Bit Generator Many well-tested random number generators –Fewer random bit generators Knuth vol. 2 eq 10 – very simple (fast!) A = x //64-bit constant X = X << 1 //shift left If overflow X = X xor A RandomBit = X & 1 // lowest bit of X –A is your choice of primitive polynomial mod 2 with many one-bits: 8 out of 64, Rajski & Tyszer 2003 –Every length-64 bit-sequence occurs once before repetition Consider accuracy in terms of intended use. What matters for our application –k one-bits in a row occurs 1 in 2^k times –Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42 times verified experimentally

Speed Comparison If this is embedded in a datastream application, speed may be important. Random number generator is the bottleneck (goal is incrementing a counter!) if RandNumber < p increment //p = 2^{-k} if k RandomBits in a row increment

Random Countdown Speedup Why generate a random number every time? –Set countdown counter P P = number of times in a row RandNumber > p [no increment] –Need one countdown counter per counter value (1..255) not per counter (billions) –Calculating P is (relatively) very expensive Fast on average if P is large  p is small Hybrid algorithm –RandNumber < p? or RandomBit for small p –Random Countdown for large p –“small” means <10 or <22 –This is the definition of a geometric distribution

Fixed Countdown Speedup Why generate a random number at all? –Increment “1 in Δφ” times deterministically Slightly different value to get correct expected value Best possible accuracy if only one item Fastest  Relies on randomness of stream –E.g. alternating items  bad counts

Speed: RandomCount  FixedCount RandomCount = 1.5x Fixed Count for Δ φ=255 Random Count = ¼x RandomBit for Δ φ=172 Punchline

How High Do You Want to Count? Inverse problem (David M. Day) Find a, never discussed in approximate counting literature –For some applications, determine by hand ahead of time –Our run-time solution –Inverse geometric sum tricky case Find root >1 for r(a) Initial guess depends on s compared to K. I.e. a K+1 vs. sa vs. (s-1) const

Inverse Problem Alternatives We’re only approximately counting, –So accuracy may not be important We only calculate function once, –So efficiency may not be important (Application dependent) –Use the initial guesses –Use binary search or lookup table –Use N=φ(C) function with easier inverse E.g. exponential + linear function, but increments are too small for small C

Conclusion Flexible Approximate Counting provides –Customization of functional form At run-time, for maximum value to count to –Fast decisions of whether to increment If datastream is sufficiently random –Use fixed countdown Else –Switch to random countdown for large increments If speed is more important than accuracy for small increments –Use random bits and power-of-two increments Random generator accuracy limits –Consider the intended use RandNumber Min r : probability(u<r) ≈ r RandomBit Max k: probability(k one-bits in row) ≈ 2 -k Thank you –Have a safe trip home