Presentation is loading. Please wait.

Presentation is loading. Please wait.

Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida.

Similar presentations


Presentation on theme: "Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida."— Presentation transcript:

1 Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida State University, AT&T, Boston University Speaker: Zhang Ye Supervisors: Dr. Nikos Mamoulis Prof. David W. Cheung Jan, 2010 ICDE’08

2 Outline Introduction  Outsourced DSMS  Security Issue Problem definition Solution Experimental result Extensions Conclusion

3 Introduction Data Stream Management System (DSMS)  To handle “continuous” generated data (e.g network, telephone, public transportation)  One pass and need efficient process; network Traffic analysis Source IP, destination IP, packet size

4 Introduction Outsourced DSMS  Two parties: Data owner; Server  Three parties: Data owner; Server; Client

5 Introduction Motivation for (three-parties) Outsourced DSMS  For a better QoS for every location (e.g. video stream) Example from http://www.cse.ust.hk/~yike/pirs.pptx One Data owner; Few Server; Many clients;

6 Introduction Outsourced DSMS model (Three parties)  Data owner and Server can see stream; Users cannot;  Users are resource limited. source of stream Data owner Server query result query result Server User

7 Introduction Security Issues  Server is lazy and random outputs result to users.  Server gives fake outputs to users.  …  Server is interested with Data owner’s stream  … Integrity Confidentiality (Privacy)

8 Introduction (Confidentiality) What we need to do: Computing on encrypted database source of encrypted stream Data owner Server authorized query result Enc ( ) ) User

9 Introduction (Integrity) What we need to do: Authentication  Based on VO and result, users can “recovery” the message in signature.  Note that users cannot see the stream! source of encrypted stream Data owner Serverquery result User +Signature of data VO, signature

10 Introduction Motivation for (two-parties) Outsourced DSMS  If user cannot afford the cost to develop DSMS, outsourced DSMS seems to be the only alternative. Materials from http://www.google.com/alerts Most computer users do not have enough resource to monitor all th news, webs and etc.

11 Introduction Outsourced DSMS model (Two parties) source of stream Data owner Server  Data owner is resource limited, in terms of computation and storage.  Stream is public to both Data owner and Server. query result

12 Introduction Security issues arise in (two-parties) Outsourced DSMS:  Server is lazy, randomly outputs result.  Server may have competing interest which wants to fraud Data owner.  Process (e.g. a virus) on Server may also impact the result.  Even communication error may change the result.  …  Server may also be interested in what the Data owner searches. Integrity Confidentiality (Privacy)

13 Introduction (Confidentiality) Is that possible?  “In 2005, the U.S. Department of Justice subpoenaed records of search terms from popular Web search engines.” [2] What we need to do? source of stream Data owner query result Enc(query)

14 Introduction (Confidentiality) Related work: Private Information Retrieval  We can solve “Confidentiality in ODSMS” problem if we can feed with keyword or a generic query to PIR. σ in [1,N] Enc(σ) Some info Public database: M 1,…,M N Given “some info”, Data owner can recovery M σ. Note M 1,…,M N rather than M σ can be see by Data owner. Data owner Server

15 Introduction (Integrity) What we need to do? source of stream Data owner Server query result Computing synopsis Check whether the result is consistence with synopsis? THIS PAPER!

16 Problem Definition The query examined in this paper: For example, in network traffic analysis: SELECT AGG(A_1),…,AGG(A_N) FROM T WHERE … GROUP BY G_1, …, G_M SELECT SUM(packet_size) FROM IP_Trace GROUP BY source_ip, destination_ip AGG in {SUM,COUNT}

17 Problem Definition The “GROUP BY” predicate partitions the streaming tuples into n groups. Data stream S: {(1,2), (3,2), (2,4), (3,-1), (i,j)} For sum: 1 2 3 4 i n 0 0 0 Add j amount to group i (where j >0 or <0) Result V=(v 1,…,v i,…,v n ) (init) 2 0 0 0 0 0 2 0 2 0 0 0 2 4 2 0 0 0 (1) (2) …

18 Problem Definition Assume that there exists a m such that for any time in the data stream S.

19 Problem Definition The problem of Continuous Query Verification on data stream (CQV) is defined as follows: Given a data stream S, a continuous query Q and a user defined parameter δ in (0,1/2), build a synopsis X of v such that for any t, given any w, and using X(v), we (1) raise an alarm with probability at least 1- δ if w≠v; (2) shall not raise an alarm if w=v. Data stream S w Computing X(V)

20 Solution Trivially, the client can execute exactly the same procedure as the server maintaining all the v i. Space complexity for above method is O(n) which motivates to find a more space efficient solution.

21 Solution Polynomial Identity Random Synopses (PIRS):  (1) Let p be a prime and randomly choose α in F p.  Define synopses X(V)=X(v 1,…v n )= Note thatis a degree polynomial.

22 Solution Generating X(V)  Data owner choose α and initial X(V)=1.  When receiving (i,u) from stream S, data owner updates X(V)=X(V)*(α-i) u Verifying W returned by Server  Data owner computes X(W)=∏(α-i) wi  Check whether X(V)=X(W) or not. If X(V)≠X(W), alarm!

23 Solution Example  We would like to verify a sum/group-by query with n=3 groups. Suppose we choose p=101 and α=37 which is kept by Data owner. Data stream S={(2,1),(3,3),(3,-2),(1,5)} Generating X(V) Verifying X(W) given W=(5,2,1) 1.X(V)=1 2.X(V)=1*(37-2)^1=35 3.X(V)=35*15=20 4.X(V)=20*(37-3)^(-2)=79 5.X(V)=79*(37-1)^5=79 X(W)≠X(V)  alarm!

24 Solution Example  We would like to verify a sum/group-by query with n=3 groups. Suppose we choose p=101 and α=37 which is kept by Data owner. Data stream S={(2,1),(3,3),(3,-2),(1,5)} Generating X(V) Verifying X(W) given W=(5,1,1) 1.X(V)=1 2.X(V)=1*(37-2)^1=35 3.X(V)=35*15=20 4.X(V)=20*(37-3)^(-2)=79 5.X(V)=79*(37-1)^5=79 X(W)=X(V)

25 Solution PIRS solves CQV problem.  Shall not raise an alarm if w=v. If w=v, X(w)=X(v) for sure. Therefore, no alarm.  Raise an alarm with probability at least 1- δ if w≠v; 1. X(V) and X(W) is at most m degree and so is f=X(V)-X(W). 2. m-degree polynomial has m roots. 3. Alarm iff. f=0 There are m α makes “NOT alarm” since α is chosen at random, Pr[NOT alarm| w≠v]=m/p  p>=m/ δ

26 Solution Space complexity for PIRS  What the data owner keeps: Which can be encoded by log(p) bits. Pr[alarm]>=1-δ  p>=m/ δ X(v)=(α-1) v1 …(α-n) vn  p>=n logp=O(logm/ δ+logn) There always exist max{m/δ,n}<=p<=2max{m/ δ,n}

27 Solution Handling sliding windows  X defined in PIRS do have “homomorphism” property: X(v 1 +v 2 )=x(v 1 )*x(v 2 )  If we focus on sum/count (i.e. +) aggregation, v1v2v3

28 Experimental results Setup:  Implemented using C++ language, with GMP lib.  2.8 GHz CPU with 512KB L2 cache and 512MB main memory.  Real data stream: World Cup data stream (100M records) IP traces data stream for AT&T backbone network (100M packets)  Queries: Count/sum (on response size) query group-by client id/object id for the WC. (50M groups) Count/sum (on packet size) query group-by source IP/destination IP for IP traces. (7M groups)

29 Experimental results Memory usage (trivial solution ) 84MB 600MB PIRS uses 27 bytes all the time! Figure from http://www.cse.ust.hk/~yike/pirs.pptx

30 Experimental result PIRS update time Namely, PIRS is able to process 10 6 count queries (10 5 sum queries) per second!

31 Extensions The following definition captures the semantics of continuous query verification with tolerance for a limited number of errors. For any w, v in N n, let E(w,v)={i | w i ≠ v i }. Then w v iff |E(w,v)|<r. Given user defined parameters r in {1,…,n} and δ in (0,1/2), build a synopsis X of v such that, for any t, given any w and using X(V), we:  1. raise an alarm with probability at least 1- δ if w v.  2. shall not raise an alarm if w v.

32 Extensions PIRS r X 11, X 12, X 13,…,X 1k X m1, X m2, X m3,…,X mk X 21, X 22, X 23,…,X 2k X i1, X i2, X i3,…,X ik At least r buckets in any layer raise an alarm! Intuitively, if there are less than r errors, no layer will raise an alarm; If there are more than r errors, at least one layer will raise an alarm k=r 2 c1 m=log(1/δ) Each PIRS with δ’=1/(c2r) c1=c2=4.819

33 Conclusion Verifying query result in an outsourced data stream setting is a problem has not been addressed before. The author proposes an storage efficient algorithm for detecting errors with high confidence while maintaining update efficiency. Their algorithm can be extend to detect pre-defined number of arbitrary errors. However, their method seems only support sum/count (“+”) aggregation according to the algebraic property in their method.

34 Thank you!

35 Reference [2] J. Bethencourt, D. Song and B. Waters.: New Techniques for Private Stream Searching. ACM Transactions on Information and System Security. 2009.


Download ppt "Randomized Synopses for Query Assurance on Data Streams Ke Yi, Feifei Li, Marios Hadjieleftheriou, George Kollios, and Divesh Srivastava HKUST, Florida."

Similar presentations


Ads by Google