Consolidating Software Tools for DNA Microarray Design and Manufacturing Mourad Atlas Nisar Hundewale Ludmila Perelygina Alex Zelikovsky.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Quality and Error Control Coding for DNA Microarrays Olgica Milenkovic ECE Department University of Colorado, Boulder IEEE Denver ComSoc.
Design Flow Enhancements for DNA Arrays Andrew B. Kahng 1 Ion I. Mandoiu 2 Sherief Reda 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department, University of California.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Border Length Minimization in DNA Array Design A.B. Kahng, I.I. Mandoiu, P. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)
Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Evaluation of Placement Techniques for DNA Probe Array Layout Andrew B. Kahng 1 Ion I. Mandoiu 2 Sherief Reda 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department,
Probe design for microarrays using OligoWiz. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
The Human Genome Project and ~ 100 other genome projects:
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Yield- and Cost-Driven Fracturing for Variable Shaped-Beam Mask Writing Andrew B. Kahng CSE and ECE Departments, UCSD Xu Xu CSE Department, UCSD Alex Zelikovsky.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Fuzzy K means.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Border Length Minimization in DNA Array Design A.B. Kahng, I.I. Mandoiu, P.A. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)
Triple Patterning Aware Detailed Placement With Constrained Pattern Assignment Haitong Tian, Yuelin Du, Hongbo Zhang, Zigang Xiao, Martin D.F. Wong.
Introduce to Microarray
Engineering a Scalable Placement Heuristic for DNA Probe Arrays A.B. Kahng, I.I. Mandoiu, P. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Engineering a Scalable Placement Heuristic for DNA Probe Arrays A.B. Kahng, I.I. Mandoiu, P. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Seongbo Shim, Yoojong Lee, and Youngsoo Shin Lithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Organizing information in the post-genomic era The rise of bioinformatics.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays Henrik Bjorn Nielsen, Rasmus Wernersson and Steen.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
MCB 317 Genetics and Genomics Topic 11 Genomics. Readings Genomics: Hartwell Chapter 10 of full textbook; chapter 6 of the abbreviated textbook.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
EE150a – Genomic Signal and Information Processing On DNA Microarrays Technology October 12, 2004.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Example of a DNA array used to study gene expression (note green, yellow red colors; also note.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Introduction to Oligonucleotide Microarray Technology
Microarray: An Introduction
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
Part 3 Gene Technology & Medicine
Example of a DNA Array (note green, yellow red colors; also note that only part of the total array is depicted)
L. Perelygina (BIO-GSU)
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Synthetic Biology: Protein Synthesis
Introduction to Bioinformatics II
1 Department of Engineering, 2 Department of Mathematics,
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
DNA & Gene Expression Transcription & Translation
Presentation transcript:

Consolidating Software Tools for DNA Microarray Design and Manufacturing Mourad Atlas Nisar Hundewale Ludmila Perelygina Alex Zelikovsky

Agenda Introduction DNA Array Flow (DAF) Benchmarks: Herpes B virus Experiments and Results Conclusion and Future Work

Motivation Microarrays provide a tool for answering a wide variety of questions about the dynamics of cells:  In which cells is each gene active?  Under what environmental conditions is each gene active?  How does the activity level of a gene change under different conditions? Stage of a cell cycle? Environmental conditions? Diseases?  What genes seem to be regulated together?

DNA Array Flow 1. Downloading genome sequence and extracting ORFs in FASTA format 2. For each gene G, find probes that hybridize to G at a given T M but do not hybridize to any other gene at that T M 3. Probe placement: determine for each probe a site on the array 2- D surface for it to be placed or synthesized. Probe embeddings: which embeds each probe into the deposition sequence 4. Photolithographic process used in sequence masking 5. Each probe binds to its target using the complementary rules. 6. can be measured by a laser scanner and converted to a quantitative value that can be read Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

Reading genomic data Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

Reading Genomic Data Input the genome ID Download genome sequence Downloading genome sequence from GenBank Bioperl ORF Extraction from genome GeneMark (Bordovsky GaTech) Or: ORF Finder Extracting Extra ORFs: ( ) ORF Parser: ORFs in FASTA format Genome ID Probe selection

ORF Extraction from genome GeneMark (Bordovsky GaTech) Or: ORF Finder Extracting Extra ORFs: ( ) ORF Parser: ORFs in FASTA format Genome ID Probe selection Downloading genome sequence from GenBank Bioperl ORF Parser

Open reading frame (ORF) is a subsequence of DNA that could potentially be transcribed into messenger RNA (mRNA) Because of the differences between prokaryotic and eukaryotic transcription systems there are two types of ORF: 1. Prokaryotes: start and stop codon 2. Eukaryotic: stop codon What is ORF?

ORF Parser Downloading genome sequence from GenBank Bioperl ORF Extraction from genome GeneMark (Bordovsky GaTech) Or: ORF Finder Extracting Extra ORF: ( ) ORF Parser Genome ID Probe selection ORFs in FASTA format

DNA Array Flow Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

Probe Selection Reading genomic data ORF preprocessing Choosing best melting temperature Ocand :find all candidate for given temperature Promide Pools of probes Physical design

Homogeneity: Ensure that the probes can bind to its target at the temperature of the experiment Sensitivity: Avoid self-hybridization: ensure that the probes will not form a secondary structure. (Such a structure will prevent the probes from binding to its target) Specificity: – the probes stay unique even after a few bases are changed – Probe must hybridize to one particular gene: For each gene G, find probes that: 1.hybridize to G at a given temperature 2.do not hybridize to any other gene at that Temperature – Avoid cross-hybridization Probe Selection Requirements

Why Promide? Possible solutions: Li and Stormo 2001 Kaderali and Schliep 2002 Rahmann (Promide) 2003 They use the same data structure: Suffix array Promide handles truly large scale datasets in a reasonable amount of time  Human GeneNest clusters: in 50 hours Neurospora Crassa:  Promide: few hours  Li and Stormo: 1 week

ORF preprocessing Classes of Sequences: A Master sequence is a sequence we wish to design oligos for. A Background sequence is a sequence against which specificity is checked. Every Master is also a Background

For each candidate oligo (substring) of a Master, do: – Check side constraints – Compute specificity: Optimal TM- alignment with every Background collection  Compute Matching Statistics: mims  Oligos Candidate Selection: ocand Choosing best melting temperature

Mask and array manufacturing Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

arrays are synthesized to a wafer Selectively expose array sites to light Flush chip’s surface with solution of protected A, C, G, T Repeat last two steps until desired probes are synthesized Mask and Array manufacturing

array probes A 3×3 array CGACACG ACAC ACGAGAG CG AGAGC Nucleotide Deposition Sequence ACG A  Mask 1 A A A A A

array probes A 3×3 array CGCGACACG ACAC ACGACGAGAG CGCG AGAGC Nucleotide Deposition Sequence ACG C  Mask 2 C C C C C C A A A A A Array manufacturing

array probes A 3×3 array CGCGACG ACGAGAG CGCG AGAGC Nucleotide Deposition Sequence ACG G  Mask 3 C C C C C C A A A A A G GG G G G A Nucleotide Deposition Sequence defines the order of nucleotide deposition A Probe Embedding specifies the steps it uses in the nucleotide sequence to get synthesized Array manufacturing

array probes A 3×3 array CGACG ACGAG CG AGC Nucleotide Deposition Sequence ACG A  Mask 1 A A A A A Border = 8 Border Reduction  Unwanted illumination  Chip’s yield Border Minimization Challenges

Lamp Mask Array Problem: Diffraction, internal reflection, scattering, internal illumination Occurs at sites near to intentionally exposed sites Reduce Border  Increase yield  Reduce cost Design objective: Minimize the border Intentionally exposed sites Unwanted illumination Border

Physical design Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

Physical Design Deposition sequence design Mask and array manufacturing Probe Selection Test control 2D-probe placement 3D-probe embedding

Probe Placement Similar probes should be placed close together Constructive placement Placement improvement operators Probe Embedding Degrees of freedom (DOF) in probe embedding DOF exploitation for border conflict reduction Physical Design

Border Reduction with Probe Placement Probe Placement Similar probes should be placed close together Deposition Sequence A A C C G G T T C T T A Probes C T C T C T T A Border = 8 C T C T T A C T T T A C Border = 4 Optimize

Border Reduction in Probe Embedding Synchronous embedding: deposit one nucleotide in each group of “ACGT” Probe Embedding Asynchronous embedding: no restriction Deposition Sequence A A C C G G T T C T T A Probes C T T A Border = 4 C T T A C TT A Border = 2

Physical Design Problem Placement of probes in n x n sites Give: n 2 probes Total border cost Find: Embedding of the probes Minimize:

Problem formulation for placement 2-dim (synchronous) Array Design Problem:  Minimize placement cost of Hamming graph H (vertices=probes, distance = Hamming) Hamming Distance (P1, P2) = number of nucleotides which are different from its counterpart= border (synchronous embedding)  on 2-dim grid graph G2 (N x N array, edges b/w neighbors) H probe G2 site

Placement Objective: Minimize Border Sort the probes in lexicographical order Probe 1 Probe 2 Probe 3 Probe 4 Probe 5 TATTATAAA A CA GGCC CGGG TATT ATAA A A CA GGCC CGGG 123 Problem: How to place the 1-D ordering of probes onto the 2-D chip? Sorting the probes order reduces discrepancies between adjacent probes

TSP+1-Threading Placement Hubbel 90’s  Find TSP tour/path over given probes with Hamming distance  Place in the grid following TSP  Adjacent probes are similar Hannenhalli,Hubbel,Lipshutz, Pevzner’02:  Place the probes according to 1-Threading  further decreases total border by 20%

Placement By Threading TATTATAAA A CA GGCC CGGG Probe 1 Probe 2 Probe 3 Probe 4 Probe 5 Thread on the chip

For each site position (i, j): Find the best probe which minimize border (i, j) Move the best probe to (i, j) and lock it in this position Switch Row-Epitaxial Placement Improvement Row placement = sort + thread + row epitaxial

Probe Embedding A A A C C C G G G T T T Deposition Sequence C T G Hypothetical Probe Group C G T Synchronous Embedding C T G Asynchronous Embedding C G T Another Embedding

Embedding Determines Border Conflicts A A A C C C T T T G G G A C T G A G T G T G A A Synchronous Embedding A G T A G G T A Deposition Sequence Probes G A A G T A G T ASAP Embedding G

Problem formulation 2-dim (synchronous) Array Design Problem:  Minimize placement cost of Hamming graph H (vertices=probes, distance = Hamming)  on 2-dim grid graph G2 (N x N array, edges b/w neighbors) 3-dim (asynchronous) Array Design Problem:  Minimize cost of placement and embedding of Hamming graph H’ (vertices=probes, distance = Hamming b/w embedded probes)  on 2-dim grid graph G2 (N x N array, edges b/w neighbors)

Post-placement Optimization Methods Asynchronous re-embedding after 2-dim placement  Greedy Algorithm While there exist probes to re-embed with gain  Optimally re-embed the probe with the largest gain  Batched greedy: speed-up by avoiding recalculations  Chessboard Algorithm While there there is gain  Re-embed probes in red sites  Re-embed probes in green sites

Analysis of hybridization intensities Genome ID Mask and array manufacturing Physical design Probe selection Hybridization experiment Reading genomic data Analysis of hybridization intensities

Experimental Study In our experiment we have considered the following parameters and we measured the results for different values of these parameters. Melting Temperature: We choose the temperatures 60  C and 65  C as best melting temperatures for our DNA probe array. Number of Candidates: We experimented with different values of K (number of candidates) for each pools of probes: 1 and 2. Chip Size: We ran our Experiments with 2 different chip sizes. We experimented with 50x50 and 60x60. We give the number of conflict and runtime for each algorithm for the Herpes B virus and simulated data

Experiments Outline Genome ID Bioperl Sequence in FASTA format ORF Extraction GenMark ORF in Fasta format ORF Parser Pools of probes in Chip format Probe Parser Select Probes: Pool pf Probes Promide Read Pool/ Genpool Placements: Sorting Placements: TSP Placements: Row placement Embedding: Chessboard Chip # of Conflicts-CPU Time for all Algorithms

TM=65, Size=50x50 Herpes B VirusSimulated Data K=2# ConflictsCPU Time(sec)# ConflictsCPU Time(sec) Initial Tsort Tsp Lalign Reptx Chessboard TM=65, Size=50x50 Herpes B VirusSimulated Data K=1# ConflictsCPU Time(sec)# ConflictsCPU Time(sec) Initial Tsort Tsp Lalign Reptx Chessboard

TM=65, Size=60x60 Herpes B VirusSimulated Data K=1# ConflictsCPU Time(sec)# ConflictsCPU Time(sec) Initial Tsort Tsp Lalign Reptx Chessboard TM=65,Size=60x60 Herpes B VirusSimulated Data K=2# ConflictsCPU Time(sec)# ConflictsCPU Time(sec) Initial Tsort Tsp LAlign Reptx Chessboard

Conclusion and Future work Conclusion: Our experiments show: The genomic data follow the pattern predicted by simulated data In case of Herpes B virus, like simulated data, increasing number of candidates per probe (k) decreases number of border conflicts during the probe placement algorithms The number of border conflicts is several times smaller than for simulated data The trade-off between number of border conflicts and the CPU time taken for the various algorithms that are defined in the physical design We give a concatenate software solution for the entire DNA array flow We explore all steps in a single automated software suite of tools Future work: The entire software suite be made available through web services Users can enter name of organism or ID and with an option of choosing to set the required parameters the suite will produce the DNA probe micro-array chip layout

Thank you