Indexing Graphs for Path Queries with Applications in Genome Research

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.
Modern Information Retrieval
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Modern Information Retrieval Chapter 4 Query Languages.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
SQL Operations Aggregate Functions Having Clause Database Access Layer A2 Teacher Up skilling LECTURE 5.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
MES Genome Informatics I - Lecture V. Short Read Alignment
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
LIMITATIONS OF ALGORITHM POWER
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
RNAseq: a Closer Look at Read Mapping and Quantitation
1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.
Burrows-Wheeler Transformation Review
Linear Algebra Review.
CC La Web de Datos Primavera 2017 Lecture 7: SPARQL [i]
Succinct Data Structures
BWT-Transformation What is BWT-transformation? BWT string compression
Arrays: Checkboxes and Textareas
Fig. 3. The values of Wc(k) and We(k) for L = n = 4.2 mil., ℓ = 70, w = 21, p = 0.01 are shown in the left plot, whereas the right.
Slides by Steve Armstrong LeTourneau University Longview, TX
Temporal Indexing MVBT.
Genomic Data Clustering on FPGAs for Compression
The short-read alignment in distributed memory environment
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
Regular Expression Matching in Reconfigurable Hardware
Discussion section #2 HW1 questions?
Dynamic Programming.
First discussion section agenda
Sequence Alignment 11/24/2018.
© A+ Computer Science - Arrays and Lists © A+ Computer Science -
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
CSE 589 Applied Algorithms Spring 1999
A Small and Fast IP Forwarding Table Using Hashing
By Yogesh Neopaney Assistant Professor Department of Computer Science
Strings: Tries, Suffix Trees
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Packet Classification Using Binary Content Addressable Memory
Presentation transcript:

Indexing Graphs for Path Queries with Applications in Genome Research Presented by: Evan Stene Spring 2017

Paper Information Authors: Published In: IEEE/ACM Transactions on Computational Biology and Bioinformatics Date of Publication: January 2014 Jouni Sirén Niko Välimäki Veli Mäkinen Department of Computer Science University of Chile Research Programs Unit University of Helsinki

Background – Alignment Part of process for converting the biological data to computer readable data (sequencing) Reading is prone to error Sequence to search (reference genome) will never match all queries exactly

Combining Reference Sequences Fig. 1. Pattern AGCTGTGT matching the multiple alignment when allowing it to change row when necessary. Mention that small variations can be used with a reference as well Also that backtracking is the alternative and is slow

Background – Suffix Arrays Array of all suffixes of a string in sorted in ascending lexicographic order Allows finding subsequence s in O(|s|) Closed form function exists to find letter prepending suffix Useful for compressed indices E.g using only sampled values

Background – Prefix Doubling Sort suffixes by their prefixes (iteratively) Each iteration uses prefixes of length 2i Each step effectively groups similar prefixes Suffixes not belonging to any group (unique prefix) are in sorted order Requires up to lg(n) iterations

Background – Prefix Doubling Example Index 1 2 3 4 5 6 7 8 9 10 11 12 String $ A B R C D # SA1 ( 1 11 ) ( 2 9 ) ( 3 10 ) SA2 8 ) SA3 SA4 Mention that # and $ are beginning/end of string markers LF = C[T[SA[i] – 1]] + rank(SA[i]) where rank is within that letter e.G For finding AB at 8,9 T[SA[9] – 1] = A C[A] = 1 Rank(SA4[9]) = 1 since it’s the first B

Motivation Combine reference sequences to provide greater accuracy when aligning Create a suffix array like structure using a directional acyclic graph (DAG) Relate the closed form transformations of suffix arrays to the DAG Compress the sorted graph to fit into local memory

Building the Graph G A C G T A – C T G G A C G T A – – – G G A T G T A – C T G G A C – T A C C T G Use examples G, T, A from end back and C at pos 3 Fig. 4. A reverse deterministic automaton corresponding to the first 10 positions of the multiple alignment in Fig. 1.

Prefix Sorting For each node v in graph A, create the following tuple and store in an array: (from(v), w, rank(v)) One for each w Sort each tuple according to its rank For tuples with unique ranks, set rank(v) = (rank(v), 0) All other tuples combine as follows: Tuples with from(u) and w = from(v) can be combined with rank(u,v) = (rank(u), rank(v)) Sort by the newly formed tuples and reassign the rank by location in array Merge nodes with same rank and from values Use example 1st A node from previous slide 1st part doubling, 2nd part pruning W is 0 if no successors from(v) = first node in path (forming the prefix) rank(v) = rank of prefix starting from v among all other prefixes w = successor to the last node in path

Prefix Sorting – Adding the edges Fig. 5. A prefix-sorted automaton built for the automaton in Fig. 4. The strings above nodes are prefixes p(v). Use middle T as edge example, it belongs to 4 prefixes

GCSA BWT is being stored in list of incoming edges

Graph as Array AGC < AGZ -> prefixes for nodes 0 and 1 Mention offset is not stored in this example Figures by: Daehwan Kim infphilo@gmail.com

Searching Example Figures by: Daehwan Kim infphilo@gmail.com

Example Continued… Figures by: Daehwan Kim infphilo@gmail.com

Example Ends Talk about how this had a unique match, if the range was still >1 the results are ambigous. Also if the range slips to 0 at any point, the query returns 0 or backtracking is required Figures by: Daehwan Kim infphilo@gmail.com

Compression Talk about how to get to a node using the bit vector and how to count the number of outgoing edges The compression of the incoming edge letter is only possible on genomic data The authors of the paper use separate bit vectors for each letter with a 1 indicating that it appears at that index Figures by: Daehwan Kim infphilo@gmail.com

Comparison Human genome version is about 3.1 Bln bp. Determinization = building the graph Backbone = main reference sequence

Comparison 0 Errors = exact matching Some of the errors in GCSA are due to difficulty mapping certain highly repetitive regions