Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov. 7 2003.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Improved Models and Algorithms for Universal DNA Tag Systems continued … a.k.a. what did we do?
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Profiles for Sequences
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Finding approximate palindromes in genomic sequences.
Probe design for microarrays using OligoWiz. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Improved Tag Set Design and Multiplexing Algorithms for Universal Arrays Ion Mandoiu Claudia Prajescu Dragos Trinca Computer Science & Engineering Department.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
Modern Information Retrieval Chapter 4 Query Languages.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
©2003/04 Alessandro Bogliolo Primer design. ©2003/04 Alessandro Bogliolo Outline 1.Polymerase Chain Reaction 2.Primer design.
Interdisciplinary Center for Biotechnology Research
PCR Primer Design Guidelines
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
IN THE NAME OF GOD. PCR Primer Design Lecturer: Dr. Farkhondeh Poursina.
Nucleic Acid Secondarily Structure AND Primer Selection Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Strand Design for Biomolecular Computation
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Dave Palmer Primer Design Dave Palmer
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Copyright OpenHelix. No use or reproduction without express written consent1.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
A Software Tool for Generating Non-Crosshybridizing libraries of DNA Oligonucleotides Russell Deaton, junghuei Chen, hong Bi, and John A. Rose Summerized.
Probe Selection Problems in Gene Sequences. (C) 2003, SNU Biointelligence Lab, DNA Microarrays cDNA: PCR from.
Motif Search and RNA Structure Prediction Lesson 9.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Solution of Satisfiability Problem on a Gel-Based DNA computer Ji Yoon Park Dept. of Biochem Hanyang University.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
RNA Structure Prediction
Introduction to Oligonucleotide Microarray Technology
Generalization of a Suffix Tree for RNA Structural Pattern Matching Tetsuo Shibuya Algorithmica (2004), vol. 39, pp Created by: Yung-Hsing Peng Date:
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
DNASequenceGenerator: A Program for the construction of DNA sequences Udo Feldkamp, Sam Saghafi, Wolfgang Banzhaf, Hilmar Rauhe DNA7 pp Summarized.
D. Darban, Ph.D Department of Microbiology School of Medicine Alborz University of Medical Sciences 1 Probe and Primer Design.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Jason Gans Los Alamos National Laboratory Improved Assay-dependent Searching of Nucleic Acid Sequence Databases.
Fac. of Agriculture, Assiut Univ.
Polymerase Chain Reaction
Information Retrieval in Practice
POA Simulation MEC Seminar 임희웅.
PCR TECHNIQUE
CS 430: Information Discovery
Selection of Oligonucleotide Probes for Protein Coding Sequences
Lecture 4: Probe & primer design
DNA Library Design for Molecular Computation
Fitness measures for DNA Computing
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
On the Range Maximum-Sum Segment Query Problem
Information Theoretical Probe Selection for Hybridisation Experiments
Chap 3 String Matching 3 -.
Russell Deaton, junghuei Chen, hong Bi, and John A. Rose
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Nearest-neighbor model for Tm
Presentation transcript:

Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov

Abstract Selection of optimal DNA oligos - Based on sequence information and hybridization free energy - Minimize background hybridization (true expression) - Minimize the number of probes needed per gene (decrease the cost)

The Challenge of Probe Design The quality of data from the oligo chip experiment relies on optimal probes How to identify the optimun probes for each genes 1)Minimum hybridization free energy for that gene 2)Maximum hybridization free energy for all other genes in the hybridizing pool

Optimum probe : problems the energy is not computable from the sequence alone 1) Complete Structure of DNA 2) Concentration of genes 3) Complexity of algorithms : how the time and memory requirements change when a genome size increases Compute the theoretical energy from each gene to every possible hybridization partner. ( ignoring internal structure and concentration ) : computationally intractable

Approach Two Major Steps – Identify the set of candidate probes for each genes - Maximize the minimum number of mismatches to every other gene in the genome – Compare candidate probes to determine their hybridization energy : mismatch types and positions affect stability of oligo DNA hybridization. : having free energy (  G) for the correct target in aceeptable range :

Algorithm (1) make a suffix array of the coding sequences from a whole genome; (2) build a sequence landscape for every gene based on the sequence suffix array; (3) choose probe candidates based on sequence features and the sequence word rank values; (4) search for matching sequences in the whole genome; (5) locate match sequence positions in all genes;

Algorithm (6) calculate the free energy G and melting temperature (Tm) for each valid target (7) select match sequences that have stable hybridization structures with other targets in the genome

1. Sequence Suffix Array Suffix Array Data structure for efficiently searching whether S is a substring of string T. For a given string T, we construct the lexicographically sorted array of all its suffixes. ( For T = mississippi, the suffix array is ) Suffix array is a sorted list of all the suffixes of a sequence Build a sequence suffix array of both strands for all coding sequences of a genome, which are concatenated

In a suffix array, all suffixes of S are in the non- decreasing lexical order. For example, S=“ ATCACATCATCA ” ” 4 ATCACATCATCA S (1) 11 TCACATCATCA S (2) 7 CACATCATCA S (3) 2 ACATCATCA S (4) 9 CATCATCA S (5) 5 ATCATCA S (6) 12 TCATCA S (7) 8 CATCA S (8) 3 ATCA S (9) 10 TCA S (10) 6 CA S (11) 1 A S (12) 1 A 2 ACATCATCA S (4) 3 ATCA S (9) 4 ATCACATCATCA S (1) 5 ATCATCA S (6) 6 CA S (11) 7 CACATCATCA S (3) 8 CATCA S (8) 9 CATCATCA S (5) 10 TCA S (10) 11 TCACATCATCA S (2) 12 TCATCA S (7) i A

2. Sequence Landscape (1) A landscape represents the frequency in the database of all the words in the query sequence 15 base query sequence, same sequence as a database Every cell in the landscape,, indicates the frequency of a word (position j, length i) in the database

3. Selection of Short or Long Probe Candidates All potential probes are evaluated by the frequency of matching subwords in the database of other sequences Can use additional constraints – No single base (As, Ts, Cs, or Gs) exceeds 50% of the probe size – The length of any contiguous As and Ts or Cs and Gs region is less than 25% of the probe size – (G+C)% is between 40 and 60% of the probe sequence – No 15-long contiguous repeats anywhere in the entire coding sequence of the whole genome – No self-complementarity within the probe sequence

Candidates selection (1) Probe size from 20 bases to 70 bases A sequence landscape for each gene is used to select probes with low frequency in the rest of the genome – Approximate match  Use the landscape information to get an estimated rank of the number of approximate matches for each potential probe of each gene Top Q (from 10 to 20) probe candidates are selected

Candidates selection (2) Based on tests of many E.coli genes, found that summing the frequencies of words at each of several different word lengths (i.e. heights in the landscape) gives a good predictor of the number of matches in the background sequence  candidates probes are chosen by finding the probe-sized words within a gene sequence that minimize the sum of its sub-word frequencies

Candidates selection (3) Example – The words sequence : ATGCCA – ATG: beginning sub-word – f (ATG): the frequency of the word which occurs in coding sequence and its complement in the whole genome – Overlapping frequency of word s F(s) = f(ATG) + f(TGC) + f(GCC) + f(CCA)

Candidates selection (4) – i is column height of the gene landscape – j is the word position – M is the probe size and j < n-M – n is the sequence length of the gene For each word size i, the 10 positions with the lowest overlap frequency are saved All positions are ranked by the number of times they occur on those lists The highest ranking probes are predicted to have the fewest approximate matches elsewhere in

4. Mismatch Searching for Candidate Probes The approximate string search is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences Candidate probes in the coding regions of the genome with four or fewer mismatches (insertion, deletion, mismatch)

5. Localization of the Match Sequences in Each Gene Fast approximate string searching The matches can be assigned to their exact positions within the specific genes All gene positions can be built into the sorted array All match positions to the probe can be located in the specific gene by a binary search of the sorted array If the match sequences are across two genes, they are not valid

(6) Free Energy and Melting Temperature Calculation Consider DNA hybridization structure Using thermodynamic parameters could predict stability of DNA oligo hybridization approximately “SantaLucia,J.J. (1998) A unified view of polymer, dumbbel, and oligonucleotide DNA nearest-neighbor thermodynamics. Natl Acad. Sci. USA, 95, 1460– 1465.”

Melting Temperature Calculation Tm can be used as a parameter to evaluate probe hybridization behavior Tm is calculated for each probe as Enthalpy(Kcal/mol), : entropy(Kcal/mol) R: molar gas constant – c: total molar concentration of the annealing oligonucleotides

Melting temperature Melting temperature are specified by nearest neighbor thermodynamic ex:p=GCAT with △ (GGAT)= △ H(GG)+ △ H(GA)+ △ H(AT) = = 25.2(kcal/mol)

Probe Selection Uniqueness – Comparing potential probe sequences with the full-length sequences of other genes being monitored Hybridization characteristics – Tm among probes – GC content – Secondary structure – Number of mismatch

Thank you