Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Probe Selection Algorithms with Applications in the Analysis of Microbial Communities James Borneman et al. Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Summarized by Sun Kim
Overview Minimizing the number of oligonucleotide probes needed for analyzing populations of rRNA clones by hybridization experiments on DNA microarrays. Propose two heuristics based on optimization techniques Simulated annealing Lagrangian relaxation
Introduction Analysis using rRNA genes Goal Adapted strategy DGGE (Denaturing Gradient Gel Electrophoresis) T-RFLP (Terminal Restriction Fragment Length Polymorphisms) Goal To develop a high-throughput approach for the examination of microbial communities Adapted strategy Oligonucleotide fingerprinting
Oligonucleotide fingerprinting The rDNA clone libraries are constructed The clones are classified by individual hybridization experiments on DNA microarrays with a series of short DNA oligonucleotides into clone types or OTUs (Operational Taxonomic Units) The nucleotide sequence of representative clones from each OUT can be obtained by DNA sequencing
Work on Probe Selection Oligonucleotide fingerprinting results Binary vector (called fingerprint), which describes which probes occur in this clone. Provide Linear fluorescence response over a range of 0-4 occurrences of a probe sequence per clone. Not consistent enough to provide statistically reliable information Nevertheless, adopt non-binary model in the strategy. Considers two models Binary membership Frequency of occurrences up to 4
Basic Probe Selection Problem A population C of m unknown rDNA clones To analyze C, need to choose a set S of oligonucleotide probes of length l Clones : approximately 1500 l : between 6 and 10 A probe p distinguishes a pair of clones c and d if p is a substring of exactly one of c or d. Goal To find a smallest set S of length-l probes such as that any two distinct clones c and d from C are distinguished by at least one probe in S.
Difficulties We do not know the rDNA sequences in the population How can we compute the minimal probe set? Even if we did have complete sequences of these clone, computing optimal probe sets for large data sets is computationally infeasible Propose two-step approach to overcome the difficulties.
Two-step Approach Choose a random subset C’ of t rDNA clones from the given population, where t is a parameter chosen by empirical study. Sequence the clones in C’. Compute an optimal, or near-optimal, probe set S for C’. Use S for analyzing the whole clone population. Intuition if the random subset C’ is large enough, the computed probe set S will be close to being optimal for the whole population. May augment the C’ with known rDNA sequences available in databases (Genbank, Ribosomal Databse) This paper focuses on Step 2.
Formulations of Probe Selection MCPS (Minimum Cost Probe Set) Minimum number of probes that distinguish all given clones. Lagrangian relaxation MDPS (Maximum Distinguishing Probe Set) A set of k probes, where k is given, that maximizes the number of distinguished pairs of clones. Simulated annealing Variants of the combinatorial optimization problem SET COVER [Hochbaum, 1997]
Previous Work Selection criteria are G+C-content of the oligomers combined with the expected frequency [Fu et al., 1992; Cutichia et al., 1993] Also, based on their frequencies in the clones [Drmanac et al., 1996] Free energy and melting temperature [Li and Stormo, 2000] Information theory (entropy maximization) [Herwig et al., 2000] First formulation as an explicit optimization
Formulations of Probe Selection and Optimization Techniques Notation : set of clones : set of preselected length-l probes : number of occurences of p in c : Given a set S of probes, S-fingerprint of c Vector of values A set S is distinguishes two clones c and d if : the set of pairs of clones that are distinguished by S
Formulations MCPS is a special case of SET COVER [Hochbaum, 1997] MDPS is a special case of MAXIMUM COVERAGE [Hochbaum, 1997]
The Simulated Annealing Algorithm for MDPS neighbor Two sets of probes are neighbors if they can be obtained from each other by substituting exactly one of the probes. According to objective functions SA+entropy, SA+pairs, SA+Largest
The Lagrangian Relaxation Algorithm for MCPS LRSOLUTION Compute an optimal solutin to the Lagrangian relaxation for a given Lagrangian multiplier FEASIBLEEXTENSION Extend the solution obtained from LRSOLUTION to a feasible solution
Subgradient optimization Finding a good multiplier vector
LR algorithm Because of constraint matrix very large
Experimental Results Data set 1. 1158 small-subunit ribosomal genes from GenBank 2. 131 large-subunit ribosomal genes from the Ribosomal Database Project II 3. 5000 eubacteria samples 4. 2000 eubacteria samples
Data set 1
Data set 3 Binary distinguishability
Data set 3 Non-binary distinguishability
Data set 4
Results of the LR algorithm on data sets 1 and 2
Conclusions Present two heuristics SA + Lagrangian relaxation Get promising results, comparing with the greedy algorithm [Herwig et al., 2000] Future work Some variants of the algorithms Speeding up on the LR algorithm