Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.

Slides:



Advertisements
Similar presentations
Honolulu, 23 rd of May 2011PESOS Evaluating the Compatibility of Conversational Service Interactions Sam Guinea and Paola Spoletini.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
RNA Assembly Using extending method. Wei Xueliang
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
1 Image Completion using Global Optimization Presented by Tingfan Wu.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Assembly.
Gene Finding Charles Yan.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Genome sequencing and assembling
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Sequence Alignment.
Multiple testing correction
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Mouse Genome Sequencing
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Introduction to Short Read Sequencing Analysis
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Motion Planning in Games Mark Overmars Utrecht University.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Human Genome.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
GigAssembler. Genome Assembly: A big picture
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Virginia Commonwealth University
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
A Fast Hybrid Short Read Fragment Assembly Algorithm
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Sequence comparison: Local alignment
Department of Computer Science
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence alignment, Part 2
Basic Local Alignment Search Tool
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches Krebsforschungszentrum Heidelberg

Problem definition

Introduction

?

Signal problems A or G ? 5 A or 4 A? 1 T or 2 T?

DNA problems Chemical properties –Coiling of DNA –Problems with dye chemistry Repetitive elements –Standard short term repeat (ALU, REPT etc.) –Long term repeats of sometimes several kb

Conventional assembly Re-para- metrisation Assembly ContigsReads Contig Join/Break Base editing Validation

Integrated Assembler-Editor Re-para- metrisation ContigsReads Contig Join/Break Base editing Validation Assembl er Automatic Editor

Assembler: Input Collection of reads –unknown relationship –unknown direction Each read –unknown error distribution –sequencing vector tagged –trace signal information –opt. base quality values –opt. quality clipping, marking HCRs (High Confidence Regions) –opt. standard repeats tagged –opt. template information

Assembly: Framework Establishing relationships of each read against each other results in full oversight over the whole assembly Problem: k reads -> time complexity O(k 2 ) Fast read comparison routines needed Smith-Waterman has O(mn), very slow

DNA-SAND algorithm Shift-AND algorithm: fault tolerant, O(cmn) modified Shift-AND for read comparison, DNA-SAND: fault tolerant, O(cn) with 0<c<12 high sensitivity and specificity –less than 0.75% missed overlaps –around 45-50% false positive hits

Assembly: Framework Fault tolerant Sandsieve principle: obvious mismatches discarded, potential matches remembered Check each read in forward and reverse complement direction

Overlap confirmation Evaluates potential overlaps Standard (banded) Smith-Waterman algorithm: max(O(bm), O(bn)) Rough calculation of SW match quality, eliminating false positive DNA-SAND matches Calculate an “alignment weight” for accepted overlaps

Overlap confirmation Rejected match –Out of band! –Overlap: 204 bases –Score: 133 –Score ratio: 65% Accepted match –Overlap: 196 bases –Score: 180 –Score ratio: 92% Weight:

Building a weighted graph Example: 6 reads All possible overlaps for 2 reads

Building a weighted graph Pruned by DNA-SAND

Building a weighted graph Smith-Waterman Prune Attribute - direction - weight

Building contigs Multiple alignment is too slow Building a consensus by iteratively aligning reads against existing consensus Important: –Order of read alignments –Finding good alignment candidates –Possibility to reject candidates

Interaction: Pathfinder & Contig Pathfinder: –search good starting point for contig building –find good alignment candidates to add to existing contig –always inspect alternative paths in overlap graph Contig: –accept reads that match to existing consensus –reject reads that do not match –find inconsistencies that ´build up slowly´ and mark these

Pathfinder: Strategy Finding starting points: –Search for node with a high number of reasonably weighted edges –Exclude edges below threshold Finding next alignment candidate: –Find reads with best nodes in contig –Recursively analyse best edges in graph

Contig: Strategy Align given read of given edge to existing contig at approximated position Accept read that match Reject reads that introduce –significantly higher error rates in contig than predicted by weighted edge –many non-editable errors in repetitive regions –inconsistencies with given template insert sizes

Contig: Raw

Contig: Edited

Contig: Raw

Contig: Edited

Repeat locator

High Confidence Regions

Extending HCRs ´beef up´existing contigs; trivial, very fast extend existing contigs; simple, quick find new contigs to build; bold, slow

Data preprocessing Fast read comparison Overlap confirmation Graph building Pathfinder Contig assembly Read extension Finished project Automatic editing Repeat marker

beta-testing almost completed assembler & editor in use to assemble projects up to reads first evaluation: human finished 35kb project (Golden Standard) without fine-tuning assembled contigs have 99,9x% identity whole genome shotgun with reads in preparation other applications like EST clustering? Status

Acknowledgements Prof. Rosenthal, Matthias Platzer, Uwe Menzel and the IMB Jena genome sequencing centre Bernd Drescher and Lion Biosciences AG, Heidelberg Canonical Homepage