Ely Porat Bar-Ilan University Group Testing and New Algorithmic Applications.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The beauty of prime numbers vs the beauty of the random Ely Porat Bar-Ilan University Israel.
Applied Algorithmics - week7
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
15-853:Algorithms in the Real World
Reviewer: Jing Lu Gigabit Rate Packet Pattern- Matching Using TCAM Fang Yu, Randy H. Katz T. V. Lakshman UC Berkeley Bell Labs, Lucent ICNP’2004.
What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich.
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
CSE 501 Research Overview Atri Rudra
Lecture 41 CSE 331 Dec 10, HW 10 due today Q1 in one pile and Q 3+4 in another I will not take any HW after 1:15pm.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Heuristic alignment algorithms and cost matrices
Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
The Goldreich-Levin Theorem: List-decoding the Hadamard code
1 Lecture 18 Syntactic Web Clustering CS
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Cryptography and Network Security Chapter 11 Fifth Edition by William Stallings Lecture slides by Lawrie Brown.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
1 Introduction to Quantum Information Processing QIC 710 / CS 768 / PH 767 / CO 681 / AM 871 Richard Cleve QNC 3129 Lecture 18 (2014)
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Organization  Introduction to Network Coding  Practical Network Coding  Secure Network Coding  Structured File Sharing  Conclusion.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Optimal XOR Hashing for a Linearly Distributed Address Lookup in Computer Networks Christopher Martinez, Wei-Ming Lin, Parimal Patel The University of.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Ahmed Osama Research Assistant. Presentation Outline Winc- Nile University- Privacy Preserving Over Network Coding 2  Introduction  Network coding 
Welcome to RPI CS! Theory Group Professors: Mark Goldberg Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik Magdon- Ismail, Bulent Yener.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Raptor Codes Amin Shokrollahi EPFL. BEC(p 1 ) BEC(p 2 ) BEC(p 3 ) BEC(p 4 ) BEC(p 5 ) BEC(p 6 ) Communication on Multiple Unknown Channels.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Lecture 20 CSE 331 July 30, Longest path problem Given G, does there exist a simple path of length n-1 ?
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
SketchVisor: Robust Network Measurement for Software Packet Processing
Updating SF-Tree Speaker: Ho Wai Shing.
New Characterizations in Turnstile Streams with Applications
Lower bounds for approximate membership dynamic data structures
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Topic 14: Random Oracle Model, Hashing Applications
Sublinear Algorithmic Tools 3
Sublinear Algorithmic Tools 2
Searching Similar Segments over Textual Event Sequences
Cryptographic Hash Functions Part I
Minwise Hashing and Efficient Search
Hash Functions for Network Applications (II)
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

Ely Porat Bar-Ilan University Group Testing and New Algorithmic Applications

Theory of Big dataPattern matching Game theoryCoding theory Compressive sensing Group testing Distributed

Bloom filters Theory of Big data Succinct data structures Streaming algorithm Sketching & LSH Big Databases

Group Testing Overview Test soldier for a disease WWII example: syphillis

Group Testing Overview Test an army for a disease WWII example: syphillis What if only one soldier has the disease? Can pool blood samples and check if at least one soldier has the disease

Another motivation

More Motivations Syphilis, HIV [Dor43] Mapping genomes [BLC91, BBK+95, TJP00] Quality control in product testing [SG59] Searching files in storage systems [KS64] Sequential screening of experimental variables [Li62] Efficient contention resolution algorithms for multiple access communication [KS64, Wol85] Data compression [HL00] Software testing [BG02, CDFP97] DNA sequencing [PL94] Molecular biology [DH00, FKKM97, ND00, BBKT96]

Adaptive group testing Number of sick d ≤ 2

Adaptive general case Number of sick≤d 2d At most d positive => There remain n/2 Run in recursion n O(dlog(n/d))

Non adaptive group testing All the tests set in advance. n t

Non adaptive group testing n t = (and,or) matrix vector multiplication

Non adaptive group testing 123n ………… t ………… x1x1 x2x2 x3x3 xnxn r1r1 r2r2 r3r3 rtrt unknown To be designed Observed Upper bound: t=O(d 2 logn) [PR08] Lower bound: t=Ω(d 2 log d n) [DR82]

Non adaptive group testing

Each one has at least 2 good tests This means that we can deal with One false positive error Dealing with any r errors, can be done with scheme of size O(rdlog d n+d 2 logn)[NPR11]

2-Stage group testing

We misclassified 2 soldiers. Using O(dlog n/d) measurement. We will misclassified O(d) soldiers, which we can easily one by one in a second stage Property of unbalanced expander.

Adaptive vs Non adaptive If one test take a day performing. Adaptive testing might take a month 2 stage group testing – take 2 days Time Store less to be check later

Adaptive vs Non adaptive If one test take a day performing. Adaptive testing might take a month 2 stage group testing – take 2 days Time Store less to be check later

Group testing for Pattern Matching Text: n Pattern: m

Part of 20M€ consortium project which is supported by MOI (cyber security) Group testing for Pattern Matching

Motivation… Stock market

Motivation.. Espionage The rest we monitor

Motivation… Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb

Motivation… Monitoring internet traffic

Group testing for Pattern Matching Text: Pattern: Pattern matching with wildcards – O(nlogm) [CH02] Up to k mismatches [CEPR07,CEPR09]. Sketching hamming distance [PL07,AGGP13]. Pattern matching in the streaming model [PP09] n m

Group testing for Pattern Matching Text: Pattern: Up to k mismatch using group testing Group testing scheme Performing the tests is easy. However how can we analyze the results?

Fast Decoding The naïve decoding take O(nt) time.

Fast Decoding We perform 3 GT schemes. 1.The original. 2.First projection. 3.Second projection.

Fast Decoding We first decode the projections. Then we check the d 2 options naively In [NPR11] we mange to have scheme With optimal number of measurements and decode time O(d 2 log 2 n). (Using recursion and 2-stage GT) If we use the scheme of 2 stage GT, We will have 4d 2 candidate to check

Faster Decoding According to LW theorem the number of candidate in the join is d 1.5 In [NPRR12] we show how to do join in optimal time. This give a scheme with optimal number of measurements, which can be decode in time O(d 1+Ԑ poly(logn))

Compressive Sensing n t

n t =

Syphilis tests detect antibodies to the bacterium that causes syphilis Someone who was sick in the past might have negligible amount of antibodies. Different people have different amounts of antibodies. Therefore testing is usually done by thresholding.

Compressive Sensing n t =

Compressive Sensing Problem definition Find a matrix Ф and an algorithm A s.t.: In [PS12] we gave the first optimal number of measurement sublinear decoding time. For p=q=1 In [GLPS09, GNPRS13] we gave a randomized solution (foreach) for p=q=2 with sublinear decoding.

How Compressive Sensing help Massive Recommender Systems Consider designing recommender system for web pages –Time a user examines a page is an implicit rating –Millions of users –Each user examines thousands of pages throughout the year –Hard to store and process the information

Fingerprint Based Approach F1 a1C1 F2 a2C2 Fn anCn Similarity (ai,aj)...

Sampling Approach c,l,t a1C1 a,c,d,f,h,l,m,n,p,r,s,t f,m,s a2C2 a,b,c,f,h,l,m,n,o,p,r,s Regular sampling doesn’t work

Minwise hashing approach h a1 a,c,d,f,h,l,m,n,p,r,s,t h a2 a,b,c,f,h,l,m,n,o,p,r,s h(x) 5,3, 7,9,2,8 h(x) 5,4, 3,7,2,8 [BHP09,BPR09,BP10,FPS11,FPS12,T13]

Min wise hash function A B

A B

Similarity A B We get ±є approximation with probability 1-δ Min wise independent

Reducing sketching space [BP10] Instead of Additional pairwise independent hash It was discover independently by Ping Li and Christian Konig

Reducing sketching space [BP10] Our algorithm estimates

Reducing sketching space even farther [BP10] We usually interesting in the case that sets are very similar. Assume J>1-t => p>1-0.5t ABA-B CS

Reducing sketching space even farther [BP10] We usually interesting in the case that sets are very similar. Assume J>1-t => p>1-0.5t ABA xor B CS This give an improvement of

Removing the min wise independent requirement [BP11] [KNW10] gave bits sketch for distinct count (F 0 ) Their sketch is not linear – However given S(A) and S(B) one can calculate S(A+B) (that will give the size of the union)

Removing the min wise independent requirement [BP11] Using F 2 instead of F 0 we managed to reduce the sketch size to Using more randomness we mange to remove factor

File sharing The naïve way

File sharing Torrent/Emule/Kazaa

File sharing Source: Clients: Coupon collector O(nlogn) In practice it could be 7Gb instead 1Gb

Network coding

1 2i nSource: Client 1: 3X 7 +2X 17, 5X 2 +X 5 +4X 10,.... Client 2: 2X 1 +3X 3 +X 17,.... Client 3: Client 4: In a big field, n linear combinations will suffice We require 1Gb upload for 1Gb file

Poison Torrent/Emule/Kaza

Signatures against poison MD5 SiSi.torrent file S 1 S 2...S n 1 2i n We might receive poisoned packet But we won't forward it

Signatures in network coding MD5 SiSi.torrent file S 1,S 2,...S n,S(X1+X2),S(X1+X3), i n There are exponential number of options

Zhao - Homomorphic signature 12n 1 2 n M= We can find a vector u s.t. Mu=0 A correct packet v will be orthogonal to u =0

Zhao - Homomorphic signature We can find a vector u s.t. Mu=0 A correct packet v will be orthogonal to u =0 But if Eve know u then she can find v which is orthogonal to u. Solution: Instead of sending u to everyone send vector

Zhao - Homomorphic signature Given v which is a linear combination of the files packets It require n+m power operations. In practice it take more time then downloading

Selective verification [PW12] S' i Packet i S'' i If we have both signatures we can choose randomly which to check

Problem Eve can combine signatures

Solution Use a linear error correcting code. 1 2 n We perform Zhao signature on each block

Analysis q^n – True combinations 1 2 n =defective (for our GT)

Analysis Pr[one block pass the test]<q n /q dn =q -(d-1)n Pr[r/2 out of r pass the test]< 2 r q -(d-1)r/2 dnn+m r1 2

Analysis dnn+m r 1 2 Using union bound: the probability that a bad packet exist is bounded by q (n+m)+r/log q-(d-1)nr Pr[one block pass the test]<q n /q dn =q -(d-1)n Pr[r/2 out of r pass the test]< 2 r q -(d-1)r/2 In practice we improve Zhao signature by a factor of 60.

Conclusion Group testing/Compressive sensing is very effective tool. We improved both construction and achieved sublinear decoding time. Surprising important applications.