06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.

Slides:

Advertisements

Similar presentations

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Advertisements

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Mining Data Streams.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Heavy hitter computation over data stream

Evaluating Search Engine

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Guaranteed Smooth Scheduling in Packet Switches Isaac Keslassy (Stanford University), Murali Kodialam, T.V. Lakshman, Dimitri Stiliadis (Bell-Labs)

Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.

1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

TinyLFU: A Highly Efficient Cache Admission Policy

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

MS 101: Algorithms Instructor Neelima Gupta

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.

The Misra Gries Algorithm. Motivation Espionage The rest we monitor.

Data Mining: Concepts and Techniques Mining data streams

CSCE 411H Design and Analysis of Algorithms Set 10: Lower Bounds Prof. Evdokia Nikolova* Spring 2013 CSCE 411H, Spring 2013: Set 10 1 * Slides adapted.

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.

SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)

An Algorithm for the Consecutive Ones Property Claudio Eccher.

Computational Molecular Biology

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

SketchVisor: Robust Network Measurement for Software Packet Processing

Mining Data Streams (Part 1)

Big Data Infrastructure

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Frequency Counts over Data Streams

The Stream Model Sliding Windows Counting 1’s

The Variable-Increment Counting Bloom Filter

Streaming & sampling.

Computational Molecular Biology

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Optimal Elephant Flow Detection Presented by: Gil Einziger,

Qun Huang, Patrick P. C. Lee, Yungang Bao

Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2002 Bin Sort, Radix.

Approximate Frequency Counts over Data Streams

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2001 Bin Sort, Radix.

Introduction to Stream Computing and Reservoir Sampling

Heavy Hitters in Streams and Sliding Windows

By: Ran Ben Basat, Technion, Israel

Hash Functions for Network Applications (II)

Lu Tang , Qun Huang, Patrick P. C. Lee

(Learned) Frequency Estimation Algorithms

Presentation transcript:

06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of large DNA libraries to isolate clones containing a particular DNA sequence(s).  This screening is important for disease-gene mapping and also for large-scale clone mapping.  Efficient screening techniques can facilitate a broad range of basic and applied biological research.

06/10/2015Applied Algorithmics - week82 Combinatorial Group Testing  When screening DNA libraries, several factors determine the cost of identifying desired objects, including: 1) the same library is screened with many different probes. 2) it is expensive to prepare a pool for testing the first time, although once a pool is prepared, it can be screened many times with different probes. 3) screening one pool at a time is expensive. Screening pools in parallel with the same probe is cheaper. 4) there are constraints on pool sizes. If a pool contains too many different clones, then positive pools can become too dilute and could be mislabeled as negative pools.

06/10/2015Applied Algorithmics - week83 Combinatorial Group Testing  In non-adaptive group testing, one must decide exactly which pools to test before any testing occurs.  A non-adaptive group testing algorithm is sometimes referred to as a one-stage algorithm. Every parallel algorithm is non-adaptive.  An alternative two-stage algorithm is a nearly non-adaptive algorithm, in which, an initial batch of simultaneous tests is carried out, then using the information from the first stage, the second batch of simultaneous tests is computed and carried out.  Because of factors 1, 2, and, 3 (see slide 2) weakly-adaptive two- stage algorithms are generally preferred when screening DNA libraries.

06/10/2015Applied Algorithmics - week84 Combinatorial Group Testing  Combinatorial Group Testing refers to the situation in which one is given: A (very) large set of objects O, and an unknown (relatively small) subset P  O.  The task is to determine the content of P by asking minimal number of queries of the type “does P intersect Q?”, where Q is a subset of O.

06/10/2015Applied Algorithmics - week85 Example of Group Testing  Set of coins O={ }  Set P  O, where P={ }  Pool Q  O ; Query “does P  Q   ?” E.g., for Q={1,3,6} the answer is YES (1); And for Q={2,4,5} the answer is NO (0);

06/10/2015Applied Algorithmics - week86 Non-Adaptive CGT  Definition: A d-disjunct matrix, a.k.a, (d,d,n)-selector is defined as any n-column binary matrix M, such that: For any d columns c 1, c 2, …, c d in M there exist d rows r 1, r 2, …, r d, s.t., entries in M available on intersection of selected d columns and rows form a permutation matrix. I.e., it contains all different unit vectors c 1 c 2 ……………… c d r1r1 r2r2 rdrd

06/10/2015Applied Algorithmics - week87 2-disjunct matrix - example  2-disjunct matrix for n =8 items based on binary representation of numbers 0,1,…,  2-disjunct (d-disjunct) matrix can be used to determine P  O of cardinality 1 (d-1)

06/10/2015Applied Algorithmics - week88 Feedback Vector Population vector, where x stands for elements of P Feedback vector x x x-----x

06/10/2015Applied Algorithmics - week89 Evidence against P membership o ---x---- o ---x---- o ----x-- ? --x Feedback vector There is an evidence against membership in P for o items

06/10/2015Applied Algorithmics - week810 Non-adaptive Group Testing  The size of the d-disjunct matrix Lower bound  (d 2 log n/log d) Upper bound O(d 2 log (n/d))  [Dyachkov, Rykov & Rashad (’82, ’89)]  Theorem: The Combinatorial Group Testing problem, with ¦P¦=d-1, can be solved in one stage and O(d 2 log (n/d)) queries/tests.

06/10/2015Applied Algorithmics - week811 Fully Adaptive Group Testing  In contrast there exists fully adaptive Combinatorial Group Testing algorithm that uses O(d log (n/d)) tests (as well as stages)  The upper bound O(d log (n/d)) matches asymptotically the information theory lower bound for Combinatorial Group Testing with d unknown items, which is  (log ( ))  Can we achieve this bound in 2 stages? n d

06/10/2015Applied Algorithmics - week812 (d,m,n)-selectors  Definition: (d,m,n)-selectors is defined as any n- column binary matrix M, such that: For any columns c 1, c 2, …, c d in M there exist m rows r 1, r 2, …, r m, s.t., entries in M available on intersection of selected d columns and m rows form m different rows of permutation matrix of size d x d.  One can prove that there exist (d,m,n)-selectors of size (number of rows) O(d 2 ·log(n/d)/(d-m+1)) Recall that (d,d,n)-selectors correspond to d-disjunct matrices However, do far there is not known efficient deterministic construction of (d,m,n)-selectors!

06/10/2015Applied Algorithmics - week813 Weakly adaptive 2-stage algorithm  Stage 1: Apply (2d,d+1,n)-selector on population with ¦P¦< d Compute feedback vector Generate evidence against membership in P The number of the items without the evidence is < 2d  Stage 2: Check the items without the evidence in < 2d separate pools  Theorem: There is a 2-stage group testing algorithm for that works in time O(d·log(n/d)).

06/10/2015Applied Algorithmics - week814 Weakly adaptive 2-stage algorithm  Proof of the Theorem: In general the size of (d,m,n)-selector is O(d 2 ·log(n/d)/(d-m+1)) and in particular the size of (2d,d+1,n)-selector is O((2d) 2 ·log(n/2d)/(2d-(d+1)+1)) = O(d·log(n/d)). Proof by contradiction. Assume that the number of items without the evidence is ≥ 2d. And consider any 2d items without the evidence and their respective 2d columns in (2d,d+1,n)-selector. At least d+1 items (among 2d) will be separated from each other in the (2d,d+1,n)-selector, so even if d out of d+1 belong to P there is at least one item that should’ve gathered the evidence against membership in P. Which contradicts the assumption.

Streaming data sources Internet traffic monitoring Web logs and click streams Financial and stock market data Retail and credit card transactions Telecom calling records Sensor networks, surveillance RFID Instruction profiling in microprocessors Data warehouses (random access too expensive). 06/10/201515Applied Algorithmics - week8

Internet Traffic Analysis  Usage trends for engineering, provisioning, abuse detection, etc.  Discover sources of large traffic  Items = IP packets  ID = Flow ID E.g., sender’s IP address  Frequent items = Heavy hitters E.g., report all flows that consume more than 1% of the link bandwidth. Counting bytes, instead of number of occurrence. Stream of IP- Packets 06/10/201516Applied Algorithmics - week8

Stream Data  Rapid, continuous arrival: Several million packets/sec  Huge volume: > 50 TB of header data per day for Gigabit router  Real time response  Small memory: fast but costly SRAM  In the sea of data, spot unusual traffic patterns and anomalies Stream of IP- Packets 06/10/201517Applied Algorithmics - week8

18 Streaming model  Motivating Scenarios Data flowing directly from generating source “Infinite” stream cannot be stored Real-time requirements for analysis  Model Stream – at each step can request next input value Assume stream size n is finite/known (fix later) Memory size M << n, possibly O(1) size 06/10/2015Applied Algorithmics - week8

Finding majority in streaming model  Given a sequence of streamed n items and a constant size memory.  In a single pass, decide if some item in the stream is in clear majority (occurs >n/2 times)? n = 12, where item 9 is in clear majority 06/10/201519Applied Algorithmics - week8

Misra-Gries Algorithm  Adopt a single counter count = 0 and associated ID, and iterate if (count>0) then if (new item = stored ID) count++; then count ++ else count --; else store the new item with count = 1.  In the end, if counter > 0, associated ID links to the only candidate ID count /10/201520Applied Algorithmics - week8

A generalization: frequent items Finding k items, each occurring at least n/(k+1) times.  Sketch of the algorithm: maintain k items, and their associated counters; if next item x is one of the k, increment respective counter; else if a zero counter available, associate x with it and set count = 1; else (all counters non-zero) decrement all k counters..count ID k....ID 2 ID 1 ID 06/10/201521Applied Algorithmics - week8

Frequent Elements: Analysis A frequent item’s counter is decremented if all counters are full: it erases k+1 items. If x occurs > n/(k+1) times, then it cannot be completely erased. Similarly, x must get inserted at some point, because there are not enough items to keep it away. 06/10/201522Applied Algorithmics - week8

Problem of False Positives  False positives in Misra-Gries algorithm It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. How can we tell if the non-zero counters correspond to true heavy hitters or not?  A second pass is needed to verify.  False positives are problematic if heavy hitters are used for billing or punishment.  What guarantees can we achieve in one pass? 06/10/201523Applied Algorithmics - week8

Approximation Guarantees  Find heavy hitters with a guaranteed approximation error  E.g., Manku-Motwani (Lossy Counting) Suppose you want  -heavy hitters, i.e., items with freq >  N An approximation parameter , where  << . (E.g.,  =.01 and  =.0001;  = 1% and  =.01% ) Identify all items with frequency >  N No reported item has frequency < (  -  )N  The algorithm uses O(1/  log (  N)) memory 06/10/201524Applied Algorithmics - week8