Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

B. Knudsen and J. Hein Department of Genetics and Ecology
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.
Self-organizing Map (SOM) in Protein Folding Based on HP Model Xiang-Sun ZHANG Dec
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
DNA Sequencing and Assembly
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
1 Sorting by Transpositions Based on the First Increasing Substring Concept Advisor: Professor R.C.T. Lee Speaker: Ming-Chiang Chen.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Physical Mapping of DNA Shanna Terry March 2, 2004.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Todd J. Treangen, Steven L. Salzberg
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Microarrays.
1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica.
1 Application of Algorithm Research to Molecular Biology R. C. T. Lee Dept. Of Computer Science National Chinan University.
A Software Tool for Generating Non-Crosshybridizing libraries of DNA Oligonucleotides Russell Deaton, junghuei Chen, hong Bi, and John A. Rose Summerized.
Nicholas Bulinski.  Informally it is the question of if a computer can quickly verify that a solution to a problem is true then can the computer solve.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Protein Structure Prediction & Alignment
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
The NP class. NP-completeness
CSCI2950-C Genomes, Networks, and Cancer
Non-additive Security Games
Genome sequence assembly
The minimum cost flow problem
CS 598AGB Genome Assembly Tandy Warnow.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Multi-Objective Optimization for Topology Control in Hybrid FSO/RF Networks Jaime Llorca December 8, 2004.
1 Department of Engineering, 2 Department of Mathematics,
Lecture 19-Problem Solving 4 Incremental Method
Graph Algorithms in Bioinformatics
CSE 589 Applied Algorithms Spring 1999
Double Cut and Join with Insertions and Deletions
Russell Deaton, junghuei Chen, hong Bi, and John A. Rose
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Presentation transcript:

Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics, AMSS, CAS

Bioinformatics Human Genome Project Large molecule data in biology, such as DNA and protein Knowledge of mathematics, computer science, information science, physics, system science, management science as well as biology Genomics DNA sequencing Gene prediction Sequence alignment

DNA Sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

DNA Sequencing (shotgun) cut many times at random known dist forward-reverse linked reads ~500 bp target DNA

DNA Sequencing (SBH) DNA array (DNA chip) with 4 3 probes Target DNA: AAATGCG

Sequencing by Hybridization Hybridize target to array containing a spot for each possible k-tuple (k-mer) The spectrum of a sequence multi-set of all its k-long substrings (k-tuples) Goal reconstruct the sequence from its spectrum Pevzner (1989): reconstruction is polynomial But …

Uniqueness of Reconstruction Different sequences can have the same spectrum: ACT, CTA, TAC  ACTAC  TACTA Non-uniqueness Probability

Experiment Errors Hybridization experiments are error prone False negative error k-tuple appears in target DNA but does not appear in its measured spectrum Repetition of k-tuple False positive error k-tuple does not appear in target DNA but does appear in its measured spectrum

Sequencing by Hybridization Target DNA …… TTTTACGC ……  Spectrum Errors: Positive (misread) / Negative (missing, repetition) TTT TTA TAC ACG CGC Ideal case TTT TTA TAC ACG CGC TGA With errors

SBH Reconstruction Problem In the case of error-free SBH experiments A desired solution of SBH is just a feasible solution including all k-tuple in the specturm For the general case There is no additional information except spectrum and the length of target DNA A feasible solution composed of a maximum cardinality subset of the spectrum shall be a reasonable desired solution

SBH Reconstruction Problem Ideal case (without repetitions and errors) Equivalent to finding an Eulerian path in a corresponding graph (Pevzner, 1989) A linear time algorithm (Fleischner, 1990) General case is NP-hard problem Branch and bound Heuristics Extensions PSBH (Positional SBH) SBH with length error

Motivations Give some criteria which can determine the most possible k-tuples at both ends and in the middle of all possible reconstructions of the target DNA These criterions greatly reduce ambiguities in the reconstruction of DNA Transform the negative errors into the positive errors These means enables us to handle both types of errors easily Separate the repetitions from both type of errors

Methods Estimate the number of k-tuples that does not occur in a solution Adjacency matrix (connection matrix) Give a lower bound of k-tuples that does not occur in all solutions from k-tuple i to j

Methods Determine the most possible k-tuples at both ends Reconstruct from the most possible end pairs to get an upper bound of SBH problem Purge the end pairs that can not have better solution than current upper bound

Methods Transform the negative errors into the positive errors Artificial k-tuple  Fill in all the possible gaps due to false negative error Negative error level  The maximal number of allowed consecutively missing k- tuples  Reduce the number of artificial k-tuples

Computational Experiments 109 DNA sequence from GenBank Simulate the SBH experiments Error models Randomly (probabilistic model) Systematically (one base mismatched model)

Conclusions Ideal case (without repetitions and errors) can be solved in polynomial time (Pevzner, 1989) General case is NP-hard problem Design efficient algorithms Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new approach to the reconstruction of DNA sequencing by hybridization. Bioinformatics, vol 19(1), pages 14-21, Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu. Combinatorial optimization problems in the positional DNA sequencing by hybridization and its algorithms. System Sciences and Mathematics, vol 3, (in Chinese) Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang. Application of neural networks in the reconstruction of DNA sequencing by hybridization. In Proceedings of the 4th ISORA, 2002.