Download presentation
Presentation is loading. Please wait.
1
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas ctas@cs CMSC 838 Presentation
2
CMSC 838T – Presentation Motivation u DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes l Identification of Microorganisms in samples Item Extraction u Complex Problem Find the necessary probes and the temperature Probe sets should be reliably detect and differentiate target sequences Large Databases NEW!! Homologous Genes (how to find specific probes)
3
CMSC 838T – Presentation Talk Overview u Overview of talk Motivation Problem Statement Algorithm Mathematical Aspects Experimentation Discussion
4
CMSC 838T – Presentation Problem Statement u Specificity vs. Sensitivity Specificity: # of non-target match is minimized Sensitivity: # of selected target sequences is maximized. u Original Problem:
5
CMSC 838T – Presentation Problem Statement u Positive Probes Database set S 0 Target S 1 For each sequence in S 1, find at least one probe For S 0 - S 1 try to avoid it (but do not care if happens) High Specificity: # of non-target matches are minimized High Sensitivity: # of covered target seq. is maximized S0S0 S1S1
6
CMSC 838T – Presentation Problem Statement u Negative Probes Determine as few as possible probes which together hybridizes with all sequences in S 0 - S 1 but with NONE in S 1. High Specificity: No seq. in S 1 may hybridize High Sensitivity: Max # of seq. in S 0 - S 1 be covered S1S1 S0S0
7
CMSC 838T – Presentation Problem cont. u Extend Problem u Specificity vs. Sensitivity Specificity: No seq. in S 1 may cross-hybridize with any negative probe Sensitivity: # of seq. covered in B must be maximized.
8
CMSC 838T – Presentation Probe Design Constraints u Sequence Related Length of probes Deviation of melting temperature of probe-target hybrids must be low (for physical reasons) No self complementary regions longer than four nucleotides (not descriptive enough) Melting temperatures of target and non-target seq. must be larger than a predefined (too close, too hard to identify) l Ensuring a minimum number of mismatches is enough (homologous sequences) u System Related Execution Time Usability
9
CMSC 838T – Presentation Algorithm u Overview Probe Generation Hybridization Prediction Probe Selection
10
CMSC 838T – Presentation Algorithm Probe Generation u Subproblem: Generate probe candidates for the sequences Keep the set as small as possible without losing any optimal candidate (exclude infeasible ones) u Suffix Tree Why? l Allows fast recognition of repetitive subsequences l Identifies non-unique probes (i.e. with more than one target) l Efficient for memory and for T computation (reduce time) How? l Tree is constructed from the sequences l Traversed (Watson-Crick complement)
11
CMSC 838T – Presentation Suffix Tree u Input: TACTACA TACTACA ACTACA CTACA TACA ACA CA A u $ denotes end of string u Constructed in linear time
12
CMSC 838T – Presentation Probe Generation u Further Improvements Filters applied for cut off l Probe length (predefined) l G-C content (for temperature) l Self-complementarity u Probes should not contain complements as subsequences Finally, remove highly conserved (non-specific) regions Insert into hashtables according to their lengths
13
CMSC 838T – Presentation Algorithm
14
CMSC 838T – Presentation Algorithm Hybridization Prediction u Subproblem: Search for the right probe Search is expensive, Intelligent Hashing used u Design A frame is moved over target and nontarget seqs. with several lengths l Previous algorithm (Kaderali 2002): Use the suffix tree At each step, hash values are calculated. If hit, predict melting temperature, store in hybridization matrix. If there are too many hits for a probe, then it is not unique, remove it Why intelligent? l Hash time is linear l Allows inexact matching because of hashing (No analysis) u Parallelization Several threads are searching for probe targets. l Tree and hashtables are fixed. One thread writes to the final matrix
15
CMSC 838T – Presentation Hybridization Prediction u Empirical Simulation: One million random probe-target pairings generated Four mismatches or one insertion or deletion plus one strong central mismatch chosen T<20 C for 93% Complexity is O ( |S 0 | |S 1 | ) l Possible probe candidates is |S 1 | (linear) l Each position of database S 0 must be checked
16
CMSC 838T – Presentation Algorithm
17
CMSC 838T – Presentation Algorithm Hybridization Prediction u Complexity u In-exact equality Only the inner three bands of DP matrix are computed O(l) where l is length
18
CMSC 838T – Presentation Algorithm Probe Selection u Subproblem: Use the hybridization matrix to finalize the probe selection l We have positive probes and negative probes to proceed u Algorithm Analysis: For each probe candidate l g: #of matches in S 1 l b: #of matches in S 0 - S 1 l t: highest melting point in S 1 Probes for which g or b values is too large, are removed Sort according to g,b and t. Apply Depth First Search u Advantages Performs well (No comparison though) Guarantees to choose all specific probes if any were found. u Disadvantages can NOT guarantee optimal selection in terms of coverage
19
CMSC 838T – Presentation Negative Probe Selection u Let S 2 =S 0 - S 1 and B subset of S 2. The probes that detect S 1 also detects some of B elements. u Algorithm for Negative Probes Apply probe generating and preselection for B. Conduct hybridization on B U S 1. Remove the probes which hybridizes with S 1. Sort the remaining probes according to their hit number. Successively select the probes which covers most target seq. u Guarantees optimal solution for coverage and probe number usage
20
CMSC 838T – Presentation Algorithm Probe Selection
21
CMSC 838T – Presentation Mathematical Aspects
22
CMSC 838T – Presentation Mathematical Aspects
23
CMSC 838T – Presentation Mathematical Aspects
24
CMSC 838T – Presentation Experimentation u Parallelized on SMP platform Classic worker-producer Intel Dual Pentium III 933 MHz, 1 GB memory u Test data ssu rRNA of ARB project 20.282 ssu rRNA sequences 1.401 < lengths < 4.179 %97 of them are shorter than 2.000 bases
25
CMSC 838T – Presentation Discussion u High Performance Execution is linear with size of database, decreases if longer probes are used u Low Memory Consumption Depends on the size of the sequence selection, NOT database size u Automatic Design of Group Probes and negative probes u High Quality Probe Design
26
CMSC 838T – Presentation Discussion u Comparison with previous work vs. ARB l Not suited for large scale probe design vs. LCF l Does not consider highly conserved data l Memory consumption is high l Works well with short probes only vs. others l Mostly can not deal with insertion and deletions l Execution is slow
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.