Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Counter Braids A novel counter architecture for network measurement.

Similar presentations


Presentation on theme: "1 Counter Braids A novel counter architecture for network measurement."— Presentation transcript:

1 1 Counter Braids A novel counter architecture for network measurement

2 2 Data Measurement: Background Accurate data measurement is needed in network and database systems.  Internet Backbones – Accounting/Billing by ISPs: Usage-based pricing – Traffic engineering – Network diagnostics and forensics: Intrusion detection, denial-of-service attacks – Products: NetFlow (Cisco), cflowd (Juniper), NetStream (Huawei) – Ref. Lin, Xu, Kumar, Sung, Ramabhadran, Varghese, Estan, Varghese, Shah, Iyer, Prabhakar, McKeown, etc.  Database Systems – Sketches, synopses of data streams – Usage and access logs – Ref. Cormode, Muthukrishnan, Babcock, Babu, Motwani, Widom, Charikar, Chen, Farach-Colton, Alon, etc  Data Centers, Cloud Computing – Monitoring network usage, diagnostics, real-time load balancing, network planning – E.g. Amazon, Facebook, Google, Microsoft, Yahoo!

3 3 The Problem  A flow is a sequence of packets with common properties; e.g. the same source- destination address, common IP header 5-tuple, common prefix in routing table – An email, a web download, a video or voice stream,...  Problem: Measure the number of bytes and packets sent by each flow in a measurement interval : 0 : 1: 2 : 1: 2 : 3 : 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24 : 1: 2: 3 : 1 : 2: 3 Packet Counter

4 4 The Constraints  Large number of active flows in a measurement interval – Several millions over 5-min interval on backbone links (CAIDA) – Problem: need lots of counters, large memory  Very high link speeds, need fast memory updates – E.g. per-packet update time on 40 Gbps links are roughly 15 nanoseconds – Problem: need memory with fast access times  Memories are either large (DRAM) or fast (SRAM); can’t get both – DRAMs have access times of 50 - 60 ns – SRAMs have access times of 4.5 -7 ns, but around 50 - 60 Mb (Micron Tech.)

5 5 Building Counter Arrays: A naïve approach  Allocate one counter per flow – Packet arrives, looks up flow-id in memory, update corresponding counter  Problems – Too much space taken up by counters, especially because not all flows are large – Need to perform flow-to-counter association for every packet  I.e. need perfect hash function : 1 : 23 : 3 : 1 : 3 : 2 : 24 Flow 1: Flow 2: Flow 3: Flow 4: Flow 5: -- 0 -- 1

6 6 Previous Approaches  Pros: Obtains exact counts, reduces cost of total memory for counters.  Cons: –Need to know the flow-to-counter association –High-complexity counter management 1. Hybrid SRAM-DRAM architecture (Shah et. al ’01, Ramabhadran and Varghese ’03, Zhao et. al. ’06) fast memory (SRAM) slow memory (DRAM) Counter management : 2 : 24 : 3 : 1 : 3

7 Approximate large flow detection 7 Previous Approaches  Measures data from large flows (elephants) and ignore small flows (mice) – Exploits Pareto distribution in network traffic (Fang, Peterson ’99, Feldmann et.al ’00) fast memory (SRAM) 2. Approximate Counting (Estan and Varghese, ’02) : 1 : 3 : 2 : 24 : 3  Pros: –Strikes a neat trade-off: accuracy for simplicity. –Flow-to-counter association becomes manageable.  Cons: Flow counts are approximate.

8 8 Summary  Exact counting methods – Hybrid architecture, complex  Approximate methods – Focus on large flows, inaccurate  Problems to address – Get rid of flow-to-counter association problem – Save counter space (reduce memory size)

9 9 Getting Rid of Flow-to-Counter Association  Use multiple hash functions – Single hash function leads to collisions. – However, one can use two or more hash functions and use the redundancy to recover the flow size. 2 2 24 1 1 3 3 0 0 26 1 1 3 3 3 3 3 3 2 2 24 1 1 3 3 25 26 6 6 6 6 3 3 3 3 Need efficient decoding algorithm for solving c = Mf – c: counts, M: adjacency matrix, f: flow sizes M = 0 1 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1

10 10 Saving Counter Space  “Braid” counters –Share the more significant bits : 2 : 24 : 3 : 1 : 3 : 2 : 24 : 3 : 1 : 3

11 11 Updating Counter Braids : 2 : 24 : 3 : 1 : 3 : 2

12 12 The Overall Architecture Elephant Traps Few, deep counters Mouse Traps Many, shallow counters Status bit Indicates overflow flows

13 13 System Diagram Counter values in CB structure Offline list of Decoder list of 2 24 3 1 3 Counter values in CB structure Online Count using CBs

14 14 Will Describe  Decoding algorithms  Sizing of Counter Braids – Comparison with Linear Programming Decoder

15 15 Decoding Algorithms 1.Typical set decoder 2.Message-passing decoder

16 16 Typical set decoder  Assumptions on flow size distribution, P: 1.At most power-law tails: 2.Decreasing digit entropy: write, entropy of f i ( l ) decreasing in l  Satisfied by real traffic distributions  Typical set decoder Let f’ = (f 1 ’,... f n ’ ) and. If f’ is the unique vector that solves c = Mf and D(P’ | P) < , then output f’; otherwise, output error. Theorem (Lu, Montanari, Prabhakar ’07) Asymptotic Optimality For any rate r > H(P) there exists a sequence of reliable sparse Counter Braids with asymptotic rate r. Hence, with the typical set decoder, Counter Braids is optimal.  However, too complex to implement

17 17 Message Passing Decoder (Lu et.al. ’08)  Linear-complexity decoding algorithm  In each layer, the decoder solves the set of linear equations, c = Mf shallow counters 2 nd layer flows (exist only conceptually) carry-over

18 18 A Practical Algorithm: Count-Min  Count-Min (Cormode and Muthukrishnan ‘05) – Algorithm: Estimate each flow’s size as the minimum counter it hits. – The flow sizes for the example below would be estimated as: 34, 34, 32 Pros — Fast and easy to implement Cons – Decision based on local information. Need lots of counters for accurate estimation – Don’t know the error magnitude; in fact, don’t know if there is an error 1 1 32 34 32

19 Count-Min 34 32 Iter 1 0 0 0 Iter 0 1 1 32 34 32 Count-Min as Message Passing c c 11 11 22 22 3 3 Upper bounds on true flow sizes

20 Count-Min 34 32 Iter 1 0 0 0 Iter 0 1 1 32 34 32 Message Passing continues 33 33 2 2 1 1

21 Count-Min 34 32 Iter 1 0 0 0 Iter 0 1 1 1 Iter 2 Message Passing continues 1 1 32 34 32 c c 11 11 22 22 3 3 1 1 Iter 3 1 1 32 Iter 4 Lower bounds on true flow sizes

22 22 Message Passing Decoder Initialize  ia = 0 8 i and 8 a Iterations for t = 1 to T ia (t) = max { c a -  j  i  ja (t-1), 1}  ia (t) = min b  a bi (t) if t is odd = max b  a bi (t) if t is even Final estimate f i (T) = min a ai (T) if T is odd = max a ai (T) if T is even

23 23 Anti-monotonicity Property Lemma. Anti-monotonicity Property (Lu et. al. ’08) If and ’ are such that for every i and a,  ia (t-1) ·  ia ’(t-1) · f i, then  ia (t) ¸  ia ’(t) ¸ f i. Consequently, since f e (0) = 0, f e (2t) · f component-wise and f e (2t) is component-wise non-decreasing. Similarly f e (2t+1) ¸ f and is component-wise non-increasing.  Corrollary: Incorrect flow size estimates can be identified when the upper bounds differ from the lower bounds.  When does the gap close? Flow index Flow size

24 24 Convergence: Sizing of Counter Braids

25 25 Convergence Definitions:  n flows and m counters  k hash functions  Error measure  Number of counters per flow where m is the number of counters and n is the number of flows

26 26 Threshold,  *= 0.72 (No. of flows=10,000) P err Iteration number Count-Min’s error   = 0.71, 21% flows not decoded, but we know their errors exactly.   = 0.725, converges in 25 iterations   = 0.74, converges faster

27 27 Threshold Let where  = nk/m is the average degree of a counter node. Let  = P (f i > 1) and Let  *= sup {  2 R: x = g  (x) has no solution 8 x 2 (0,1] }. Theorem. The Threshold (Lu et. al. ’08)  * = k /  * such that in the large n limit (i) If  >  *, P err → 0; the lower bounds and upper bounds converge to the true flow sizes. (ii) If  ≤  *, P err > 0; there exists a positive proportion of flows such that the lower bounds differ from the upper bounds, i.e, some flows are incorrectly decoded.

28 28 Density Evolution  Lemma (Modern Coding Theory, Richardson and Urbanke) Locally Tree-like Property Consider running the algorithm for a finite number of iterations. The error probability averaged over the class of random graphs, as the system size grows, converges to the error probability on a finite tree, where messages are independent of each other.  Flow nodes have degree k  Counter nodes have random degrees of Poisson distribution with mean nk/m

29 29 Density Evolution  Error probability of a counter-to-flow message where j is the degree of the counter node, which has a Poisson distribution with mean nk/m

30 30 Density Evolution  Error probability of a counter-to-flow message, averaged over j ~ Poisson(  ) where is the moment generating function of the Poisson distribution.

31 31 Density Evolution  Error probability of a counter-to-flow message  Error probability of a flow-to-counter message if t is odd if t is even where k is the number of hash functions and  = P (f>1)

32 32 Density Evolution  Error probability of a counter-to-flow message Error probability of a flow-to-counter message if t is odd if t is even where k is the number of hash functions and  = P (f>1)  Substitute y t and repeat for odd and even iterations, we obtain  Now look for the fixed point of g (x) = x

33 33 Computation of Threshold  Above the threshold, decoding error goes to 0 ß > ß*

34 34 Computation of Threshold  Above the threshold, decoding error goes to 0  Below the threshold, decoding error stays positive ß < ß* ß > ß*

35 35 Computation of Threshold  Above the threshold, decoding error goes to 0  Below the threshold, decoding error stays positive  Threshold is computed when g(x) is tangential to the diagonal ß = ß* ß < ß*ß > ß*

36 36 Computation of Threshold  Above the threshold, decoding error goes to 0  Below the threshold, decoding error stays positive  Threshold is computed when g(x) is tangential to the diagonal ß = ß*ß < ß*ß > ß* Fraction of flows incorrectly decoded Iteration number

37 37 Comparison with Linear Programming Decoder

38 38 Relation to Compressed Sensing  Compressed sensing – Storing sparse vectors using random linear transformations Baraniuk, Candes, Donoho, Gilbert, Hassibi, Indyk, Milenkovic, Muthukrishnan, Romberg, Strauss, Tanner, Tao, Tropp, Vershynin, Wainwright, et al  Problem statement minimize ||f|| 1 subject to c= Mf – LP decoding: worst-case cubic complexity  We consider sparse non-negative solutions and compare the Linear Programming (LP) decoder and the Message Passing (MP) decoder – Given an  -sparse vector f (with  proportion of non-zero entries) – let m = n  *(  ) be the number of equations needed for each algorithm to recover f as n → 1, with high probability over the space of random matrices and the space of vector f

39 39 Comparison (Lu, Montonari, Prabhakar ‘08) Theorem. LP Threshold. (Donoho, Tanner ‘05) Consider a mxn matrix M with columns drawn independently from a multivariate normal distribution on R d with nonsingular covariance matrix. For  2 [0,1], numerically obtainable and plotted in the graph. As  → 0, -- LP  our algorithm Theorem. MP Threshold. (Lu, Montanari, Prabhakar ‘08) Consider sparse matrices with given degree distributions. For any  2 [0,1],  * ·  1/2. For  · 1/8,

40 40 Implementation Flow label collection (Lu, Prabhakar, ’09)  Flow labels are collected and stored in DRAM, with a small fraction (e.g. 0.5%) missing  The resulting problem is equivalent to decoding with noise FPGA Implementation (Lu, Luo, Prabhakar, submitted)  Simple to implement – 2512 Slice Flip Flops and 5342 4-input LUTs on NetFPGA, with 2.8MB memory – 65-nm ASIC synthesis: 0.0487 mm 2 for logic and 15.66 mm 2 for memory  Parallel architecture to achieve high throughput: 84 Gbps at 125 MHz clock

41 41 Trace Simulation CAIDA trace (2 hours data) simulation Measurement interval: 5 min Number of flows: ~0.9 million † Includes Bloom filters for flow label collection and status bits for first-layer counters *Hash table and hybrid assume 1.5 times over-provisioning using d-left hash tables # Multi-stage filters assume 4000x4 x 3 bytes and 3000 x 32 bytes in flow memory. 100,000 flows, largest ~3000 Decoding complexity: 1 million flows take 15 seconds to decode on a 2.6 GHz machine. Counter BraidsHash tableHybridApproximate SRAM space (MB) 2.9†28.4*18.9*0.144 # Results 99.535% exact, 0.465% with known error exact 0.9% error for the largest 3000 flows


Download ppt "1 Counter Braids A novel counter architecture for network measurement."

Similar presentations


Ads by Google