GENOME ASSEMBLY Candidatus Carsonella Ruddii. Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Looping while … do …. Condition Process 2 Process 1 Y Repeated Loop.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 CHAPTER 4 - PART 2 GRAPHS 1.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Next Generation Sequencing, Assembly, and Alignment Methods
String Recognition Simple case: recognize 1101 “ ” 0 “1” 0 “11” 0 Reset 1 “110” “1101”
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Aho-Corasick String Matching An Efficient String Matching.
Genome sequencing and assembling
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Genetic Programming.
2-1 Relations and Functions
De-novo Assembly Day 4.
Physical Mapping of DNA Shanna Terry March 2, 2004.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Department of Computer Science and Engineering Bangladesh University of Engineering and Technology Md. Emran Chowdhury Department of CSE Northern University.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Module 5 – Networks and Decision Mathematics Chapter 23 – Undirected Graphs.
Flow of Control Part 1: Selection
Chapter 13: sed Say what?. In this chapter … Basics Programs Addresses Instructions Control Spaces Examples.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
An Introduction to Programming with C++ Sixth Edition Chapter 7 The Repetition Structure.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Chapter 4 Practice cont.. Practice with nested loops 1.What will be the output of the following program segment: 1.for (int i = 1; i
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Clearly Visual Basic: Programming with Visual Basic 2008 Chapter 13 How Long Can This Go On?
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Computability and Complexity 2-1 Problems and Languages Computability and Complexity Andrei Bulatov.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Chapter 5 Sequence Assembly: Assembling the Human Genome.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
“Not the real deal but close” Ch 11 Simulations. Real World Example This is a simulation of what it feels.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
When finished with quiz…
Naotoshi Seo, Hiroshi Toyoizumi Performance Evaluation Laboratory
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Research in Computational Molecular Biology , Vol (2008)
Date of download: 1/1/2018 Copyright © ASME. All rights reserved.
Introduction to Genome Assembly
Prof. Carolina Ruiz Department of Computer Science
CS 598AGB Genome Assembly Tandy Warnow.
Finding a Eulerian Cycle in a Directed Graph
Graph Algorithms in Bioinformatics
FUNCTION NOTATION AND EVALUATING FUNCTIONS
How to Make and Use a DNA Fragment Standard Curve
UNDERSTANDING FUNCTIONS
Objective- To graph a relationship in a table.
Fragment Assembly 7/30/2019.
Integers – Place Value & Ordering Numbers – Bingo Only Answers
Prof. Carolina Ruiz Department of Computer Science
Presentation transcript:

GENOME ASSEMBLY Candidatus Carsonella Ruddii

Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies fragmented at random points, with repeats and missing regions. ■I simulated my own ‘reads’ departing from Candidatus Carsonella Ruddii – one of the smallest genomes. –Full genome available at NCBI ■Data simulation: 2 programs, kmerComp, readsDict kmerComp input: string, integer k (lentgh of k-mers) output: dictionary (unordered) [values are k-mer composition list]

Data simulation (cont.) ■readsDict Input: string, integer k(length of k-mer), integer c (number of copies) Output: dictionary of reads, some k-mers may be missing and some repeated. My aim was to replicate what the experimentally obtained reads may look like: “c” corresponding to the number of copies of the original DNA. An embedded FOR loop run through each of the copies (first FOR loop runs through the kmerComp c times), selecting a number of k-mers from it. Example from k-mer dictionary: 0: 'ATGAATAATATTTTTGCAAAAATAA', 1: 'TGAATAATATTTTTGCAAAAATAAC', 2: 'GAATAATATTTTTGCAAAAATAACT', 3: 'AATAATATTTTTGCAAAAATAACTG', 4: 'ATAATATTTTTGCAAAAATAACTGC', Example from reads dictionary: 0: 'TTTTTTTTTTTAAAAAAAAAAATAT', 1: 'CTAATAGAAAAATAATTTTTTATTT', 2: 'GAACAAAATGATATAAAAAAAATTA', 3: 'TATGTGCTGGGACTTTTATTAATTC', 4: 'TTTAATTTAACAATGGAAAAACAAA',

Steps towards genome assembly: ■adjList (Eulerian graph) Input: list of k-mers Output: dictionary, each prefix is paired with corresponding suffix ADJACENCY LIST FROM K-MERS: 'TTTTGTGTTGGAAAATAATGATTT': 'TTTGTGTTGGAAAATAATGATTTA, 'TTGCAGGAATAAATGCAGCTAGAA': 'TGCAGGAATAAATGCAGCTAGAAA ADJACENCY LIST FROM READS: TTGCAGGAATAAATGCAGCTAGAA': 'TGCAGGAATAAATGCAGCTAGAAA' 'GCTAAAAATATAATTTTATGTGCT': 'CTAAAAATATAATTTTATGTGCTG'

Genome assembly (cont.) ■StringR() Worked only for k-mer composition (when each k-mer had only one possible path) Finds first term (that which is suffix for nothing) and follows path from there. ■EulerianCycle Input: adjacency list (dictionary) Output: list of points (trajectory in the graph) FOR LOOP – start at each of possible unused edges; modify list unusedEdges everytime. Update a second list {edge list] with each point it tries out. Embedded WHILE LOOP – runs through unusedEdges until there are no more of them. 2 options: a point can be followed only by another, a point can be followed by more than one (choose randomly). At the end of the WHILE Loop, print EdgeList.

Results and Evaluation ■Difficulty in dealing with incomplete or ‘excessive’ information. ■Difficulty in taking ‘random’ decisions. –In the program eulerianCycle – I was unable to deal with the ‘mistakes’ or everytime the program randomly took a path that was shorter than the number of edges. It would just give me an error. ■I had expected the problems in the book chapter to more clearly be applicable to the sample genome.