Download presentation
1
Group Testing and Its Applications
Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics National Chiao Tung Univerity (交通大學應用數學系)
2
What is “Group Testing”?
A problem: I have a favor number in my mind and the number is in {1,2,3,…,63}. Can you find the number by asking (me) as few queries as possible?
3
The answer : At most 6 (1) Is the number larger than 32? Yes or Not?
(2) If the answer is “Yes”, then ask if the number is larger than 48 or not. On the other hand, if the answer is “Not”, then ask if the number is larger than 16 or not. Continued …
4
Another Idea 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
5
2 3 6 7 10 11 14 15 18 19 22 23 26 27 30 31 34 35 38 39 42 43 46 47 50 51 54 55 58 59 62 63
6
4 5 6 7 12 13 14 15 20 21 22 23 28 29 30 31 36 37 38 39 44 45 46 47 52 53 54 55 60 61 62 63
7
8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47 56 57 58 59 60 61 62 63
8
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63
9
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
10
Formal Definitions Consider a set N of n items consisting of at most d positive (used to be called defective) items with the others being negative (used to be called good) items. A group test, sometimes called a pool, can be applied to an arbitrary set S of items with two possible outcomes; negative: all items in S are negative; positive: at least one positive item in S, not knowing which or how many.
11
Adaptive (Sequential) and Non-adaptive (Parallel)
Can you see the difference between the above two ways in finding the answer? The first one (adaptive or sequential): You ask the second question (query) after knowing the answer of the first one and continue …. That is, the previous knowledge will be used later. The second one (non-adaptive or parallel): You can ask all the questions (queries) at the same time.
12
Save Money or Time An adaptive (sequential) algorithm conducts the tests one by one and the outcomes of all previous tests can be used to set up the later test. (Save money!?) A non-adaptive algorithm specifies a set of tests in advance so that they can be conducted simultaneously; thus forbidding using the information of previous tests. (Save time!)
13
Blood Testing AIDS Tests (World war II)
14
One by one
15
Red is positive and Black is negative
16
Keep going
17
More positives (72 tests)
18
Test a group at one time
19
Some group is negative!
20
They are done!
21
Save a lot of Money! (Adaptive Idea)
Save 33-3 times
22
Non-adaptive Algorithm
We can use a matrix to describe a non-adaptive algorithm. Items are indexed by columns and the tests are indexed by rows. Therefore, the (i, j) entry is 1 if the item j is included in the pool i (for test), and 0 otherwise.
23
Incidence Matrix 1
24
Can we find positives from the above matrix?
Yes, we can if the number of positives is not too many, say at most 2, by running the 9 tests simultaneously corresponding to rows. The reason is that the union of (at most) 2 columns can not contain any other distinct column. (?) How to implement?
25
Ask at most 9 queries for at most 2 positives (Limited sizes)
1, 4, 7, 10 1, 5, 8, 11 1, 6, 9, 12 2, 4, 9, 11 2, 5, 7, 12 2, 6, 8, 10 3, 4, 8, 12 3, 5, 9, 10 3, 6, 7, 11
26
Human Genome Project
27
Human Genome Project Goals: Milestones:
■ identify all the approximate 30,000 genes in human DNA, ■ determine the sequences of the 3 billion chemical base pairs that make up human DNA, ■ transfer related technologies to the private sector, and ■ address the ethical, legal, and social issues (ELSI) that may arise from the project. Milestones: ■ 1990: Project initiated as joint effort of U.S. Department of Energy and the National Institutes of Health ■ June 2000: Completion of a working draft of the entire human genome ■ February 2001: Analyses of the working draft are published ■ April 2003: HGP sequencing is completed and Project is declared finished two years ahead of schedule U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003
28
How does the human genome stack up?
Organism Genome Size (Bases) Estimated Genes Human (人類) 3 billion 30,000 Laboratory mouse (白老鼠) 2.6 billion Mustard weed (A. thaliana) 100 million 25,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (果蠅) 137 million 13,000 Yeast (酵母菌) 12.1 million 6,000 Bacterium (大腸桿菌) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9
30
去氧核糖核酸 DNA: Deoxyribonucleic Acid Nucleotide (核苷酸) Adenine (A)
Thymine (T) Guanine (G) Cytosine (C)
31
計算分子生物學簡介 – DNA解序後,需分析訊息,但資料量30億對。 – 需要現代科技輔助。 主要課題 – 序列組合 – 序列分析 – 基因認定 – 生物資訊資料庫、種族樹建構、蛋白質三維結構推測…。
32
序列比對探究 說明 – 找序列中「相似」及「相異」的部份。 – 為何要比對序列? – 為何要使用電腦比對? – 如何使用電腦比對?
– 困難點:序列型態多樣、需建構不同的資料結構及演算法 (Algorithm)。
33
Example of Group Testing
In screening clone library the goal is to determine which clones in the clone library hybridize with a given probe in an efficient fashion. A clone is said to be positive if it hybridize with the probe(探針), and negative otherwise.
34
Complex Model An extension of classical group testing problem.
A set of complexes, each of which is a subset of basic items, is given and the property (positive or negative) of each complex is waiting to be determined. The corresponding group test determines whether a subset contains at least one positive complex (instead of just one positive item).
35
Competitive Group Testing
In classical group testing, the number of positive substances is usually assumed bounded while for competitive group testing, no information on the number of positive substances is given. A learning graph problem can be viewed as a competitive group testing on complex model with each complex of size two. If we are learning hypergraphs, then the size of complex is getting larger.
36
Graph Learning Problem
Let H be a family of labeled graphs (hidden) defined on the set [n] = {1, 2, …, n}. Referring to the information of “belonging to H”, we wish to identify G by edge-detecting queries, each of which tells whether a subset (with some constraints) of [n] induces an edge (or more info.) of G. Who is hiding there?
37
Motivated by applications in DNA physical mapping
38
Whose Idea? In 1996, A. Sorokin et al. proposed the following idea in Genome Research: For physical mapping, some cloned fragments are produced to cover the whole DNA molecule with some gaps and then some of them are assembled to form longer continuous fragments (contigs). However, the information about the mutual placement of the contigs on the DNA sequence is lost.
39
Shotgun Sequencing Shotgun sequencing is a throughput technique resulting in the sequencing of a large number of bacterial genomes, mouse genomes and the celebrated human genomes. In all such projectss, we are left with a collection of contigs that for special reasons cannot be assembled with general assembly algorithms. Continued …
40
Random shotgun approach
genomic segment cut many times at random (Shotgun) 6
41
Whole-genome shotgun sequencing
Short reads are obtained and covering the genome with redundancy and possible gaps. Circular genome
42
Reads are assembled into contigs with unknown relative placement.
43
Primers : (short) fragments of DNA characterizing ends of contigs.
44
A PCR (Polymerase Chain Reaction) reaction reveals if two primers are proximate (adjacent to the same gap). Multiplex PCR can treat multiple primers simultaneously and outputs if there is a pair of adjacent primers in the input set and even sometimes the number of such pairs.
45
Two primers of each contig are “mixed together”
Find a Hamiltonian cycle by PCRs!
46
Primers are treated independently.
Find a perfect matching by PCRs.
47
Goal Our goal is to provide an experimental protocol that identifies all pairs of adjacent primers with as few PCRs (queries) (or multiplex PCRs respectively) as possible.
48
Mathematical Models Hidden Graphs (Reconstructed)
49
Models Multi-vertex model Quantitative multi-vertex model
k-vertex model Quantitative k-multi-vertex model Learning a hidden graph by edge-detecting queries: 8
50
Example 3 4 8 7 1 2 5 6 G :
51
Q({1,2,3,4,5,6,7,8}) = 1 3 4 8 7 1 2 5 6
52
Q({1,2,3,4}) = 0 3 4 8 7 1 2 5 6
53
Q({1,2,3,4,5,7}) = 1 3 4 8 7 1 2 5 6
54
3 4 8 7 1 2 5 6 v = {5}, S \ {v} = {1, 2, 3, 4} Q({1,2,3,4,5}) = 1 v Q({5,1,2}) = 0 Q({5,3}) = 1 5 2 1 4 3 5 2 1 4 3
55
Then, … We can delete the edge and start it over again to find another edge until all the edges are found. Clearly, if the hidden graph we are looking for is of large size compare to n, then this algorithm is not going to be a good one. We may simply ask all the n(n-1)/2 2-subsets of [n].
56
Lower bound (Information theoretic)
Theorem. For any , edge-detecting queries are required to identify a graph drawn from the class of all graphs with vertices and edges. Proof. 18
57
Better Lower Bound! In reconstructing a graph with n vertices and m edges, all graphs of order n and size less than m must be considered as graphs to be reconstructed. (Using computation trees.) Theorem. log(0 i m(n(n-1)/2 chooses i)) edge-detecting queries are required to identify a graph drawn from the family of all graphs with n vertices and m edges.
58
Main Ideas of search We start with finding the maximal matching!
Then, PARTITION_OF_VERTEX_SET (Identify-Bipartite) If there are edges between two independent sets A and B, we may find all of the edges by using (a, B)-algorithm, a A. Finally, find HIDDEN_GRAPH.
59
5 Steps Algorithm
60
Using Designs (Affine Planes)
An affine plane of order q is a balanced incomplete block design, 2 - (q2, q, 1) design, such that the set of blocks can be partitioned into parallel classes. A Steiner triple system of order 9, STS(9), is an example of an affine plane of order 3. It is well-known that the existence of an affine plane of order q is equivalent to the existence of a complete set of mutually orthogonal Latin squares of order q which exists when q is a prime power.
61
Continued … Theorem. For each positive integer n there exists a prime p power such that 2n p2 n. There exists a pairwise balanced design (PBD), 2 – (n, K, 1) design, such that K contains positive integers not greater than p and the design has at most p(p+1) blocks. (?) All edges of a hidden graph defined on [n] are hiding in the blocks!
62
How man queries have we used?
The number of queries is at most m(log n + 3) + ’(G)(log n + 5) + 1 where m is the number of edges in G and ’(G) is the matching number. By testing the p(p+1) blocks of the PBD obtained above first and then reconstructing the edges in each block, we need at most m(log p + 3) + m(log p + 5) + p2 + p queries which is about m(log n) + 9m + 3n queries.
63
Open Problems (Group Testing)
What is the minimum number of tests we need in order to find at most two positives in 10,000 items? (懸賞 NT 2,000 dollars) How about three positives out of 10,000 items? (懸賞 NT 3,000 dollars) How about if mistakes (lies) happened sometime? Say, you face one mistake! (懸賞 NT 5,000 dollars)
64
Work harder!
65
Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.