Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Error Measurement and Iterative Methods
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
COFFEE: an objective function for multiple sequence alignments
On the complexity of finding common approximate substrings Theoretical Computer Science 306 (2003) Patricia A. Evans, Andrew D. Smith, H. Todd.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Ensemble Learning: An Introduction
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Chapter 9: Huffman Codes
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Chapter 2: Algorithm Discovery and Design
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Chih-Ming Chen, Student Member, IEEE, Ying-ping Chen, Member, IEEE, Tzu-Ching Shen, and John K. Zao, Senior Member, IEEE Evolutionary Computation (CEC),
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
John Lafferty Andrew McCallum Fernando Pereira
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Algorithmic Problems in Algebraic Structures Undecidability Paul Bell Supervisor: Dr. Igor Potapov Department of Computer Science
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
 Presented By: Abdul Aziz Ghazi  Roll No:  Presented to: Sir Harris.
Approximate Matching of Run-Length Compressed Strings
Department of Computer Science
Chapter 9: Huffman Codes
Jin Zhang, Jiayin Wang and Yufeng Wu
CSE 589 Applied Algorithms Spring 1999
Podcast Ch23d Title: Huffman Compression
Fragment Assembly 7/30/2019.
Presentation transcript:

Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,

The State of Sequence Assembly The success of full genome sequencing implies that shotgun sequence assembly with current technologies is largely a solved problem With conventional sequence technologies: read length: about 500 base pairs error rate: under 2% coverage: about 10 times for bacteria, about 30 times for humans But single molecule sequencing methods promise to change these parameters significantly

Single Molecule Sequencing Methods Single molecule sequencing methods, such as being developed by U.S. Genomics, promise much longer read lengths read length: hundreds of thousands of bases ? error rate: ? "No free lunch hypothesis" - we anticipate that the new technologies will (at least initially) have significantly higher error rates than current sequencing machines. Our assumption – long lousy reads.

Our Problems What levels of coverage will be needed to get accurate sequence information from long noisy reads? How do we efficiently assemble such long noisy reads?

Sequencing from Subsequences Why subsequences? We anticipate that certain single molecule sequencing technologies will be prone to having many base deletion errors Example: in the U.S. Genomics technology, sequence bases are replaced by tagged bases. Untagged bases are invisible, generating subsequences. We study the effect of per base deletion frequencies on our ability to accurately reconstruct long sequences. Our study revolves around this theoretical error model. But our algorithm can be easily generalized.

Notation n – length of the original sequence; p – base deletion rate; k – number of reads; R i – a read of the original sequence;

Quality of Reconstruction Metric Our score function is where ED is the edit distance, s is the target sequence of length n, s’ – sequence reconstructed from the reads. An empty string has a score of 0; The target string has a score of 1;

Lower Bounds on Reconstruction Quality k=0 -> report a random string of some length. Computational experiments showed that reporting a string of length 0.6n gives best results (score=0.37) k=1 -> report this read; score=1-p (because (1-p)*n characters will be matched and the rest will be inserted).

Lower Bounds on Reconstruction Quality

Information Theory Bounds What is the minimal number of reads that we need to reconstruct the sequence? First, we need to know the number of sequences of length n in which a given read of length k occurs: Each of reads gives us at most this number of bits of information: Therefore, we will need at least this many reads:

Bounds on the Number of Reads Conclusion: reconstruction becomes impossible for error rates higher than 75%, but possible for 50%

Sequence Assembly Algorithm We use a two phase procedure: Insertion: align a read R i with consensus sequence C i-2 and build a new consensus C i-1 Refining and Cleanup: delete/reorder characters from current consensus to better reflect the reads and delete unused characters R1R1 R2R2 C1C1 R3R3 C2C2 R4R4 C3C3 refine & cleanup

Read Insertion How to choose the optimal alignment to insert a new read into current consensus C i ? Pairwise align all reads against C i and for each position of C i, compute the number of times each particular character was inserted into it at this position. Align the read being inserted against the weighted consensus sequence using the insertion weights generated before.

Consensus Refining Pairwise alignment from reads is prone to two types of errors: inserting a pair of characters in a wrong order and undersampling Solution: Try to make a swap and a character doubling at each position and see if it improves the alignment score for some reads. ACTAA ATA ACA ATCA R 1 : R 2 : ACTAA refine

Clean up Procedure Pairwise align all reads against the target to weight the positions of S by frequency of use. Update weights after each alignment to bias matches toward frequently used positions. Delete all characters matched fewer than a certain number of times.

Complexity Analysis Each insertion step takes O(k*n*n) time Each refining step takes O(k*n*n) time Each cleanup step takes O(k*n*n) time Total: O(n iter *k*n*n) where n iter is the number of iterations

Results For base deletion rates as high as 40%, we can completely reconstruct sequences with high enough coverage (50 times coverage) For larger error rates, our algorithm finds shorter supersequences, i.e. there are multiple answers so exact reconstruction is impossible. Here we ignored the possibility of insertion/substitution errors, but it is clear our methods can adapt to different error models at lower error rates.

Results

Future Work We want to build your single molecule sequence assembler! Our Stroll shotgun sequence assembler (Chen and Skiena) was used by Brookhaven National Laboratory to sequence the bacterial Borrelia burgdorferi. We are particularly interested in identifying better error models for sequencing technologies under current development.

The End Questions?