SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

Cooperative Query Answering for Semistructured Data Speakers: Chuan Lin & Xi Zhang By Michael Barg and Raymond K. Wong.
CS 336 March 19, 2012 Tandy Warnow.
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Motion Planning for Point Robots CS 659 Kris Hauser.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
15-853Page : Algorithms in the Real World Suffix Trees.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Fractional Cascading CSE What is Fractional Cascading anyway? An efficient strategy for dealing with iterative searches that achieves optimal.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Bounding Volume Hierarchy “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presented by Mathieu Brédif.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.
1 Data Structures and Algorithms Graphs I: Representation and Search Gal A. Kaminka Computer Science Department.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Document Retrieval Problems S. Muthukrishnan. Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
De-novo Assembly Day 4.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Graph Theory. A branch of math in which graphs are used to solve a problem. It is unlike a Cartesian graph that we used throughout our younger years of.
Bijective tree encoding Saverio Caminiti. 2 Talk Outline Domains Prüfer-like codes Prüfer code (1918) Neville codes (1953) Deo and Micikevičius code (2002)
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 dBBlue:Low Diameter and Self-routing Bluetooth Scatternet Wen-Zhan Song, Xiang-Yang Li, Yu Wang and Weizhao Wang Department of Computer Science Illinois.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Graphs ORD SFO LAX DFW Graphs 1 Graphs Graphs
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Phylogeny - based on whole genome data
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
CSE 2331/5331 Topic 9: Basic Graph Alg.
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
Enumerating Distances Using Spanners of Bounded Degree
Fast Nearest Neighbor Search on Road Networks
Near-Optimal (Euclidean) Metric Compression
Department of Computer Science University of York
Searching Similar Segments over Textual Event Sequences
CSE 589 Applied Algorithms Spring 1999
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Algorithm Course Dr. Aref Rashad
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

splitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014

Outline 1Overview 1splitMEM Algorithm 2Pan-genome Analysis

Objective Compressed de Bruijn graph Graphical representation depicts how population variants relate to each other, especially where they diverge at branch points How well conserved is a sequence? What are network properties? Several complete genomes Available today for many microbial species, near future for higher eukaryotes Pan-genome: analyze multiple genomes of species together A B C D Input Output

de Bruijn graph Node for each distinct kmer Directed edge connects consecutive kmers Nodes overlap by k- 1 bp Self-loops, multi-edges AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once

Merge non-branching chains of nodes Min. number of nodes that preserve path labels  Usually built from uncompressed graph  We build directly in O(n log n) time and space Compressed de Bruijn graph

Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

9 strains of Bacillus anthracis k=1000 Compresssed de Bruijn graph

Outline 1Overview 1splitMEM Algorithm 2Pan-genome Analysis

Compresssed de Bruijn graph Input: AGAAGTCC$ATAAGTTA Types of nodes: i.repeatNodes ii.uniqueNodes

1Construct set of repeatNodes 1.Build suffix tree of genome 2.Mark internal nodes that are MEMs, length  k 3.Preprocess suffix tree for LMA queries 1.Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length  k repeatNodes

MEMs Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. TGCACGCAA

1 MEM occurs twice  GCA TGCAC…GGCA A

Overlapping MEMs TGCCATCGCCAACCAT

1Construct set of repeatNodes 1.Build suffix tree of genome 2.Mark internal nodes that are MEMs, length  k 3.Preprocess suffix tree for LMA queries 1.Compute repeatNodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length  k repeatNodes

…………  ’’ ’’  x y z  ‘  u‘u‘    x α y y z   z z    MEM  x x y z  y x y z  u  Split MEM to repeatNodes

splitMEM splitMEM software o C++ o open source Input modes: o single genome: fasta file o pan-genome: multi-fasta file Multik-mer construct several compressed de Bruijn graphs without rebuilding suffix tree

Outline 1Overview 1splitMEM Algorithm 2Pan-genome Analysis

Pan-genome analysis B. Anthracis and E. coli Examine graph properties: Number nodes, edges, avg. degree Node length distribution Genome sharing among nodes Distribution of node distances to core genome

Pan-genome analysis Graphs of main chromosomes 9 strains of Bacillus anthracis Selection of 9 strains of Escherichia coli SpeciesKNodesEdgesAvg. Degree B. anthracis B. anthracis B. anthracis E. coli E. coli E. coli

Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

Histogram of Node Lengths Pan-genome analysis

Histogram of Node Lengths Pan-genome analysis Spike at 2k: SNPs

Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

Fraction of nodes with each level of genome sharing B. anthracis E. coli Pan-genome analysis

B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

Pan-genome analysis Graph encodes sequence context of segments. Core genome: subsequences that occur in at least 70% of underlying genomes. Nodes can be further in terms of hops while closer by base pairs. Branch and Bound Search

Pan-genome analysis Node distances to core genome Searched hop radius

Summary Identify pan-genome relationships graphically. Topological relationship between suffix tree and compressed de Bruijn graph. Direct construction of compressed de Bruijn graph for single or pan-genome. Introduce suffix skips. Explore pan-genome graphs of B. anthracis, E. coli. SplitMEM: Graphical pan-genome analysis with suffix skips. Marcus, S, Lee, H, Schatz, MC (2014) BioRxiv

Acknowledgments Michael Schatz Hayan Lee Giuseppe Narzisi James Gurtowski Schatz Lab IT department Todd Heywood

Thank You!

Future work Improve splitMEM software: Reduce space using compressed full-text index instead of suffix tree Approximate indexing of strains to form a pan-genome graph Alignment of reads to pan-genome Biological applications: Functional enrichment of core-genome and genome specific segments Expand study to larger collection of microbes and larger genomes