Group Testing and Its Applications

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Covers, Dominations, Independent Sets and Matchings AmirHossein Bayegan Amirkabir University of Technology.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Graph Classes A Survey 張貿翔 中正大學資訊工程學系. Meeting Room Reservation One meeting room Reservation: –Gary10:00~11:00 –Mary10:30~12:00 –Jones13:30~14:30 –Amy14:00~15:30.
EBI European Bioinformatics Institute. EBI The European Bioinformatics Institute (EBI) part of EMBL is a centre for research and services in bioinformatics.
統計與生物資訊學 1 : 課程簡介 (Introduction) 陳光琦助理教授 (Kuang-Chi Chen)
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Computational problems, algorithms, runtime, hardness
Chapter 0 Computer Science (CS) 計算機概論 教學目標 瞭解現代電腦系統之發展歷程 瞭解電腦之元件、功能及組織架構 瞭解電腦如何表示資料及其處理方式 學習運用電腦來解決問題 認知成為一位電子資訊人才所需之基本條 件 認知進階電子資訊之相關領域.
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
1 Advanced Chemical Engineering Thermodynamics Chapter 1 The phase equilibrium problem.
1 蛋白質簡介 暨南大學資訊工程系 2003/05/06. 2 蛋白質是由 20 種胺基酸所組成.
8-1 Chapter 8 技術與流程 組織的技術 製造業的核心技術 服務業的核心技術 非核心技術與組織管理 工作流程的相依性.
Approximation Algorithms
鄭瑞興的個人簡介 中山資工所 鄭瑞興.
法律系 系所科助之血淚辛酸史 劉蕙綺. 系上推行困難處 ( 學期初 ) 傳統習慣:法律系以教科書為主 很多老師沒有電子檔案 專、兼任老師使用平台的意願 因老師多為資深老師,因此在使用電腦部 份可能比較需要幫助 通常學生知道訊息的來源是藉由 BBS 或者 是系上的系板,使用意願會降低.
Analyzing Case Study Evidence
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.
演算法課程 (Algorithms) 國立聯合大學 資訊管理學系 陳士杰老師 Course 7 貪婪法則 Greedy Approach.
34.NP Completeness Hsu, Lih-Hsing. Computer Theory Lab. Chapter 34P.2 為何學 NP-Complete 50 年前的電腦 ( 一間房間大 ) 可解的問題的極限 與現在電腦 ( 個人電腦 ) 可解的極限理論上是相 同的, 除非電腦的製程有重大突破.
技術與流程 本章內容 組織的技術 製造業的核心技術 服務業的核心技術 非核心技術與組織管理 工作流程的相依性 Chapter 8
資訊教育 吳桂光 東海大學物理系助理教授 Tel: 3467 Office: ST223 Office hour: Tue, Fri. (10-11am)
Reporter gene. Figure DNA chip 簡介 DNA chip 的操作原 理是利用 DNA 序列的配 對特性, 讓樣品與晶片上 對應的鹼基 (base) 進行 雜合反應 (hybridization), 藉以檢測樣品中的基因 表現, 晶片上每平方公分.
國立清華大學資訊工程學系 資訊工程系 2009/11/03P-1 Quiz & Solution 09810CS_ Computer Systems & Application Fall.
1 Recombinant DNA 暨南大學資訊工程系 2003/04/29. 2 大綱 Cut — 限制酶 Paste —DNA 接合酶 Copy —PCR Search —Southern Blotting Reading —Sanger Method.
4-cycle Designs Hung-Lin Fu ( 傅恆霖 ) 國立交通大學應用數學系 Motivation The study of graph decomposition has been one of the most important topics in graph theory.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
Physical Mapping of DNA Shanna Terry March 2, 2004.
MAPS OF DNA AND INTERVAL GRAPHS by Akshita Gurram.
Advances since Watson & Crick
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Graph Theory Chapter 7 Eulerian Graphs 大葉大學 (Da-Yeh Univ.) 資訊工程系 (Dept. CSIE) 黃鈴玲 (Lingling Huang)
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Graph Theory Chapter 6 Matchings and Factorizations 大葉大學 (Da-Yeh Univ.) 資訊工程系 (Dept. CSIE) 黃鈴玲 (Lingling Huang)
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Pooling designs for clone library screening in the inhibitor complex model Department of Mathematics and Science National Taiwan Normal University (Lin-Kou)
Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Information on Informatics 鍾青萍 P. Pete Chong 長庚大學 資管系客座教授 休士頓城中大學 馬太爾講座教授 ( 資管 )
Nonunique Probe Selection and Group Testing Ding-Zhu Du.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Matrix Completion Problems for Various Classes of P-Matrices Leslie Hogben Department of Mathematics, Iowa State University, Ames, IA 50011
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Graph Theory Chapter 10 Coloring Graphs 大葉大學 (Da-Yeh Univ.) 資訊工程系 (Dept. CSIE) 黃鈴玲 (Lingling Huang)
Great Theoretical Ideas in Computer Science for Some.
Introduction to NP Instructor: Neelima Gupta 1.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Computational Molecular Biology
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Sequencing DNA 交通大學應用數學系 傅恆霖. Human Genome Project.
Group Testing – An Introduction 簡介群試理論
Search Methodology – An Introduction Hung-Lin Fu 傅 恆 霖.
Combinatorial Group Testing 傅恆霖應用數學系. Mathematics? You are you, you are the only one in the world just like you! Mathematics is mathematics, there is.
Group Testing and Its Applications
Group Testing and Its Applications
How to Search Efficiently?
Minimum Spanning Tree 8/7/2018 4:26 AM
Computational Molecular Biology
Grid-Block Designs and Packings
遺傳的原理I: 基因的基礎認識 參考章節: 第十二章 去氧核糖核酸的結構與功能 第十三章 從DNA到蛋白質 第十四章 基因的調控
Learning a hidden graph with adaptive algorithms
Presentation transcript:

Group Testing and Its Applications Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics National Chiao Tung Univerity (交通大學應用數學系)

What is “Group Testing”? A problem: I have a favor number in my mind and the number is in {1,2,3,…,63}. Can you find the number by asking (me) as few queries as possible?

The answer : At most 6 (1) Is the number larger than 32? Yes or Not? (2) If the answer is “Yes”, then ask if the number is larger than 48 or not. On the other hand, if the answer is “Not”, then ask if the number is larger than 16 or not. Continued …

Another Idea 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

2 3 6 7 10 11 14 15 18 19 22 23 26 27 30 31 34 35 38 39 42 43 46 47 50 51 54 55 58 59 62 63

4 5 6 7 12 13 14 15 20 21 22 23 28 29 30 31 36 37 38 39 44 45 46 47 52 53 54 55 60 61 62 63

8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47 56 57 58 59 60 61 62 63

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Formal Definitions Consider a set N of n items consisting of at most d positive (used to be called defective) items with the others being negative (used to be called good) items. A group test, sometimes called a pool, can be applied to an arbitrary set S of items with two possible outcomes; negative: all items in S are negative; positive: at least one positive item in S, not knowing which or how many.

Adaptive (Sequential) and Non-adaptive (Parallel) Can you see the difference between the above two ways in finding the answer? The first one (adaptive or sequential): You ask the second question (query) after knowing the answer of the first one and continue …. That is, the previous knowledge will be used later. The second one (non-adaptive or parallel): You can ask all the questions (queries) at the same time.

Save Money or Time An adaptive (sequential) algorithm conducts the tests one by one and the outcomes of all previous tests can be used to set up the later test. (Save money!?) A non-adaptive algorithm specifies a set of tests in advance so that they can be conducted simultaneously; thus forbidding using the information of previous tests. (Save time!)

Blood Testing AIDS Tests (World war II)

One by one

Red is positive and Black is negative

Keep going

More positives (72 tests)

Test a group at one time

Some group is negative!

They are done!

Save a lot of Money! (Adaptive Idea) Save 33-3 times

Non-adaptive Algorithm We can use a matrix to describe a non-adaptive algorithm. Items are indexed by columns and the tests are indexed by rows. Therefore, the (i, j) entry is 1 if the item j is included in the pool i (for test), and 0 otherwise.

Incidence Matrix 1

Can we find positives from the above matrix? Yes, we can if the number of positives is not too many, say at most 2, by running the 9 tests simultaneously corresponding to rows. The reason is that the union of (at most) 2 columns can not contain any other distinct column. (?) How to implement?

Ask at most 9 queries for at most 2 positives (Limited sizes) 1, 4, 7, 10 1, 5, 8, 11 1, 6, 9, 12 2, 4, 9, 11 2, 5, 7, 12 2, 6, 8, 10 3, 4, 8, 12 3, 5, 9, 10 3, 6, 7, 11

Human Genome Project

Human Genome Project Goals: Milestones: ■ identify all the approximate 30,000 genes in human DNA, ■ determine the sequences of the 3 billion chemical base pairs that make up human DNA, ■ transfer related technologies to the private sector, and ■ address the ethical, legal, and social issues (ELSI) that may arise from the project.   Milestones: ■ 1990: Project initiated as joint effort of U.S. Department of Energy and the National Institutes of Health ■ June 2000: Completion of a working draft of the entire human genome ■ February 2001: Analyses of the working draft are published ■ April 2003: HGP sequencing is completed and Project is declared finished two years ahead of schedule U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003

How does the human genome stack up? Organism Genome Size (Bases) Estimated Genes Human (人類) 3 billion 30,000 Laboratory mouse (白老鼠) 2.6 billion Mustard weed (A. thaliana) 100 million 25,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (果蠅) 137 million 13,000 Yeast (酵母菌) 12.1 million 6,000 Bacterium (大腸桿菌) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9

去氧核糖核酸 DNA: Deoxyribonucleic Acid Nucleotide (核苷酸) Adenine (A) Thymine (T) Guanine (G) Cytosine (C)

計算分子生物學簡介 – DNA解序後,需分析訊息,但資料量30億對。 – 需要現代科技輔助。 主要課題 – 序列組合 – 序列分析 – 基因認定 – 生物資訊資料庫、種族樹建構、蛋白質三維結構推測…。

序列比對探究 說明 – 找序列中「相似」及「相異」的部份。 – 為何要比對序列? – 為何要使用電腦比對? – 如何使用電腦比對? – 困難點:序列型態多樣、需建構不同的資料結構及演算法 (Algorithm)。

Example of Group Testing In screening clone library the goal is to determine which clones in the clone library hybridize with a given probe in an efficient fashion. A clone is said to be positive if it hybridize with the probe(探針), and negative otherwise.

Complex Model An extension of classical group testing problem. A set of complexes, each of which is a subset of basic items, is given and the property (positive or negative) of each complex is waiting to be determined. The corresponding group test determines whether a subset contains at least one positive complex (instead of just one positive item).

Competitive Group Testing In classical group testing, the number of positive substances is usually assumed bounded while for competitive group testing, no information on the number of positive substances is given. A learning graph problem can be viewed as a competitive group testing on complex model with each complex of size two. If we are learning hypergraphs, then the size of complex is getting larger.

Graph Learning Problem Let H be a family of labeled graphs (hidden) defined on the set [n] = {1, 2, …, n}. Referring to the information of “belonging to H”, we wish to identify G by edge-detecting queries, each of which tells whether a subset (with some constraints) of [n] induces an edge (or more info.) of G. Who is hiding there?

Motivated by applications in DNA physical mapping

Whose Idea? In 1996, A. Sorokin et al. proposed the following idea in Genome Research: For physical mapping, some cloned fragments are produced to cover the whole DNA molecule with some gaps and then some of them are assembled to form longer continuous fragments (contigs). However, the information about the mutual placement of the contigs on the DNA sequence is lost.

Shotgun Sequencing Shotgun sequencing is a throughput technique resulting in the sequencing of a large number of bacterial genomes, mouse genomes and the celebrated human genomes. In all such projectss, we are left with a collection of contigs that for special reasons cannot be assembled with general assembly algorithms. Continued …

Random shotgun approach genomic segment cut many times at random (Shotgun) 6

Whole-genome shotgun sequencing Short reads are obtained and covering the genome with redundancy and possible gaps. Circular genome

Reads are assembled into contigs with unknown relative placement.

Primers : (short) fragments of DNA characterizing ends of contigs.

A PCR (Polymerase Chain Reaction) reaction reveals if two primers are proximate (adjacent to the same gap). Multiplex PCR can treat multiple primers simultaneously and outputs if there is a pair of adjacent primers in the input set and even sometimes the number of such pairs.

Two primers of each contig are “mixed together” Find a Hamiltonian cycle by PCRs!

Primers are treated independently. Find a perfect matching by PCRs.

Goal Our goal is to provide an experimental protocol that identifies all pairs of adjacent primers with as few PCRs (queries) (or multiplex PCRs respectively) as possible.

Mathematical Models Hidden Graphs (Reconstructed)

Models Multi-vertex model Quantitative multi-vertex model k-vertex model Quantitative k-multi-vertex model Learning a hidden graph by edge-detecting queries: 8

Example 3 4 8 7 1 2 5 6 G :

Q({1,2,3,4,5,6,7,8}) = 1 3 4 8 7 1 2 5 6

Q({1,2,3,4}) = 0 3 4 8 7 1 2 5 6

Q({1,2,3,4,5,7}) = 1 3 4 8 7 1 2 5 6

3 4 8 7 1 2 5 6 v = {5}, S \ {v} = {1, 2, 3, 4} Q({1,2,3,4,5}) = 1 v Q({5,1,2}) = 0 Q({5,3}) = 1 5 2 1 4 3 5 2 1 4 3

Then, … We can delete the edge and start it over again to find another edge until all the edges are found. Clearly, if the hidden graph we are looking for is of large size compare to n, then this algorithm is not going to be a good one. We may simply ask all the n(n-1)/2 2-subsets of [n].

Lower bound (Information theoretic) Theorem. For any , edge-detecting queries are required to identify a graph drawn from the class of all graphs with vertices and edges. Proof. 18

Better Lower Bound! In reconstructing a graph with n vertices and m edges, all graphs of order n and size less than m must be considered as graphs to be reconstructed. (Using computation trees.) Theorem. log(0  i  m(n(n-1)/2 chooses i)) edge-detecting queries are required to identify a graph drawn from the family of all graphs with n vertices and m edges.

Main Ideas of search We start with finding the maximal matching! Then, PARTITION_OF_VERTEX_SET (Identify-Bipartite) If there are edges between two independent sets A and B, we may find all of the edges by using (a, B)-algorithm, a  A. Finally, find HIDDEN_GRAPH.

5 Steps Algorithm

Using Designs (Affine Planes) An affine plane of order q is a balanced incomplete block design, 2 - (q2, q, 1) design, such that the set of blocks can be partitioned into parallel classes. A Steiner triple system of order 9, STS(9), is an example of an affine plane of order 3. It is well-known that the existence of an affine plane of order q is equivalent to the existence of a complete set of mutually orthogonal Latin squares of order q which exists when q is a prime power.

Continued … Theorem. For each positive integer n there exists a prime p power such that 2n  p2  n. There exists a pairwise balanced design (PBD), 2 – (n, K, 1) design, such that K contains positive integers not greater than p and the design has at most p(p+1) blocks. (?) All edges of a hidden graph defined on [n] are hiding in the blocks!

How man queries have we used? The number of queries is at most m(log n + 3) + ’(G)(log n + 5) + 1 where m is the number of edges in G and ’(G) is the matching number. By testing the p(p+1) blocks of the PBD obtained above first and then reconstructing the edges in each block, we need at most m(log p + 3) + m(log p + 5) + p2 + p queries which is about m(log n) + 9m + 3n queries.

Open Problems (Group Testing) What is the minimum number of tests we need in order to find at most two positives in 10,000 items? (懸賞 NT 2,000 dollars) How about three positives out of 10,000 items? (懸賞 NT 3,000 dollars) How about if mistakes (lies) happened sometime? Say, you face one mistake! (懸賞 NT 5,000 dollars)

Work harder!

Thank you for your attention!