Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Slides:

Advertisements

Similar presentations

Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common approach to the.

Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Searching “It is better to search, than to be searched” --anonymous.

CSC212 Data Structure - Section AB Lecture 20 Hashing Instructor: Edgardo Molina Department of Computer Science City College of New York.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

Suffix Trees and Suffix Arrays

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

1 Omics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Next Generation Sequencing, Assembly, and Alignment Methods

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽宋曉亞陳翰平.

CSC1016 Coursework Clarification Derek Mortimer March 2010.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

L l Chapter 11 discusses several ways of storing information in an array, and later searching for the information. l l Hash tables are a common approach.

Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.

Bowtie: A Highly Scalable Tool for Post-Genomic Datasets

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

COSC 2007 Data Structures II

Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

MES Genome Informatics I - Lecture V. Short Read Alignment

High Throughput Sequence Analysis with MapReduce Michael Schatz June 18, 2009 JCVI Informatics Seminar.

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.

Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

CSC211 Data Structures Lecture 20 Hashing Instructor: Prof. Xiaoyan Li Department of Computer Science Mount Holyoke College.

P p Chapter 11 discusses several ways of storing information in an array, and later searching for the information. p p Hash tables are a common approach.

Gao, Ge Center for Bioinformatics Peking University

P p Chapter 11 discusses several ways of storing information in an array, and later searching for the information. p p Hash tables are a common approach.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

CS201: Data Structures and Discrete Mathematics I Hash Table.

Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

Short Read Mapping On Post Genomics Datasets

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

Department of Computer Engineering Faculty of Engineering, Prince of Songkla University 1 9 – Hash Tables Presentation copyright 2010 Addison Wesley Longman,

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.

Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.

Linear Time Suffix Array Construction Using D-Critical Substrings

RNAseq: a Closer Look at Read Mapping and Quantitation

Short Read Mapping On Post Genomics Datasets

Burrows-Wheeler Transformation Review

15-853:Algorithms in the Real World

VCF format: variants c.f. S. Brown NYU

13 Text Processing Hongfei Yan June 1, 2016.

Search by Hashing.

Hash Tables Chapter 11 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common approach.

CSC2431 February 3rd 2010 Alecia Fowler

Next-generation sequencing - Mapping short reads

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

Hash Tables Chapter 11 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common approach.

Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.

Hash Tables Chapter 11 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common approach.

CSC212 Data Structure - Section KL

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

Hash Tables Chapter 11 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common approach.

Presentation transcript:

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Next Generation Sequencing Today: Illumina NGS platform, Fastq files Sequence bioinformatics: Hash tables Suffix arrays Burrows-Wheeler transform

Illumina Slides from Kurt Strueber Genome Center MPIPZ Cologne

RNA-seq, ChIP-seq, Methyl-seq …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC Goal: identify variations Goal: classify, measure significant peaks Short Read Applications Genotyping Reference genome Short reads

Illumina

Slides taken from Michael Main University of Colorado Hash tables

The simplest kind of hash table is an array of records. This example has 701 records. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] An array of records... [ 700] Hash tables

Each record has a special field, called its key. In this example, the key is a long integer field called Number. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ]... [ 700] [ 4 ] Number Hash tables

The number might be a person's identification number, and the rest of the record has information about the person. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ]... [ 700] [ 4 ] Number Hash tables

When a hash table is in use, some spots contain valid records, and other spots are "empty". [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Hash tables

In order to insert a new record, the key must somehow be converted to an array index. The index is called the hash value of the key. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number In our case: The keys are short sequences, and the records contain their location in the genome Inserting a new record

Typical way create a hash value: [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number (Number mod 701) What is ( mod 701) ? Inserting a new record

Typical way to create a hash value: [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number (Number mod 701) What is ( mod 701) ? 3 Inserting a new record

The hash value is used for the location of the new record. Number [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number [3] Inserting a new record

The hash value is used for the location of the new record. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Inserting a new record

Here is another new record to insert, with a hash value of 2. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Collisions

This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number When a collision occurs, move forward until you find an empty spot. Collisions

This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number When a collision occurs, move forward until you find an empty spot. Collisions

This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number When a collision occurs, move forward until you find an empty spot. Collisions

This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number The new record goes in the empty spot. Collisions

[ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number If the keys were short sequences, where would you place the sequence ATACCG? (NB: this is an ill-posed question) A Quiz

The data that's attached to a key can be found fairly quickly. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number Searching for a Key

Calculate the hash value. Check that location of the array for the key. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Not me. Searching for a Key

Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Not me. Searching for a Key

Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Not me. Searching for a Key

Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Yes! Searching for a Key

When the item is found, the information can be copied to the necessary location. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number Number My hash value is [2]. Yes! Searching for a Key

 Hash tables store a collection of records with keys.  The location of a record depends on the hash value of the record's key.  When a collision occurs, the next available location is used.  Searching for a particular key is generally quick. T HE E ND Summary

Suffix Arrays Suffix arrays were introduced by Manber and Myers in 1993 More space efficient than suffix trees A suffix array for a string x of length m is an array of size m that specifies the lexicographic ordering of the suffixes of x. Idea: Every substring is a prefix of a suffix

Example of a suffix array for acaaacatat$ Starting position of that suffix in the search string Suffix Arrays

Naive in place construction – Similar to insertion sort – Insert all the suffixes into the array one by one making sure that the new inserted suffix is in its correct place – Running time complexity: O(m 2 ) where m is the length of the string Manber and Myers give a O(m log m) construction in their 1993 paper. Suffix Array Construction

O(n) space where n is the size of the database string Space efficient. However, there’s an increase in query time Lookup query – Binary search – O(m log n) time; m is the size of the query – Can reduce time to O(m + log n) using a more efficient implementation Suffix Array Construction

find(Pattern P in SuffixArray A): i = 0 lo = 0, hi = length(A) for 0<=i<length(P): Binary search for x,y where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi lo = x, hi = y return {A[lo],A[lo+1],...,A[hi-1]} Suffix Array Search

Search ‘is’ in mississippi$ 011i$ 18ippi$ 25issippi$ 32ississippi$ 41mississippi$ 510pi$ 69ppi$ 77sippi$ 84sissippi$ 96ssippi$ 103ssissippi$ 1112$ Examine the pattern letter by letter, reducing the range of occurrence each time. - First letter i: occurs in indices from 0 to 3 - Second letter s: occurs in indices from 2 to 3 Done. Output: issippi$ and ississippi$ Suffix Array Search

It can be built very fast. It can answer queries very fast: – How many times ATG appears? Disadvantages: – Can’t do approximate matching – Hard to insert new stuff dynamically (need to rebuild the array) Summary

98/albert/JAVA+html/SuffixTreeGrow.html g.shtml x/ Links

Bowtie: A Highly Scalable Tool for Post-Genomic Datasets (Slides by Ben Langmead)

Short Read Applications Genotyping RNA-seq, ChIP-seq, Methyl-seq …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC Goal: identify variations Goal: classify, measure significant peaks

Short Read Applications Finding the alignments is typically the performance bottleneck …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC

Short Read Alignment Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists –Approximate answer to: where in genome did read originate? …TGATCATA… GATCAA …TGATCATA… GAGAAT better than What is “good”? For now, we concentrate on: …TGATATTA… GATcaT …TGATcaTA… GTACAT better than –Fewer mismatches is better –Failing to align a low-quality base is better than failing to align a high-quality base

Indexing Genomes and reads are too large for direct approaches like dynamic programming Indexing is required Choice of index is key to performance Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds

Indexing Genome indices can be big. For human: Large indices necessitate painful compromises 1.Require big-memory machine 2.Use secondary storage > 35 GBs > 12 GBs 3.Build new index each run 4.Subindex and do multiple passes

Burrows-Wheeler Transform Burrows Wheeler Matrix Last column contains the characters preceding the characters in the first column BWT(T) a c a a c g $ $ a c a a c g g $ a c a a c a c g $ a c a a a c g $ a c c a a c g $ a a c a a c g $ Rotate string one by one in each row Sort suffixes lexicographically Text T

Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded –Matrix will be shown for illustration only In long texts, BWT(T) contains more repeated character occurrences than the original text  easier to compress! Burrows Wheeler Matrix Last column BWT(T) T

Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” –i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column BWT(T) Burrows Wheeler Matrix Rank: 2 Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994

Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T  BWT[ LF(i) ] + T; i = LF(i) –Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Could be called “unpermute” or “walk-left” algorithm Final T

BWT in Bioinformatics Oligomer counting –Healy J et al: Annotating large genomes with exact word matches. Genome Res 2003, 13(10): Whole-genome alignment –Lippert RA: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comp Bio 2005, 12(4): Smith-Waterman alignment to large reference –Lam TW et al: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24(6):

Comparison to Maq & SOAP PC: 2.4 GHz Intel Core 2, 2 GB RAM Server: 2.4 GHz AMD Opteron, 32 GB RAM Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 SOAP not run on PC due to memory constraints Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) Reference: Human (NCBI 36.3, contigs) CPU time Wall clock time Reads per hour Peak virtual memory footprint Bowtie speedup Reads aligned (%) Bowtie –v 2 (server)15m:07s15m:41s33.8 M1,149 MB-67.4 SOAP (server)91h:57m:35s91h:47m:46s0.08 M13,619 MB351x67.3 Bowtie (PC)16m:41s17m:57s29.5 M1,353 MB-71.9 Maq (PC)17h:46m:35s17h:53m:07s0.49 M804 MB59.8x74.7 Bowtie (server)17m:58s18m:26s28.8 M1,353 MB-71.9 Maq (server)32h:56m:53s32h:58m:39s0.27 M804 MB107x74.7 Bowtie delivers about 30 million alignments per CPU hour

TopHat: Bowtie for RNA-seq TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads using Bowtie, and then analyzes the mapping results to identify splice junctions between exons. –Contact: Cole Trapnell –

Nicolas Delhomme, EMBL Heidelberg University of Umeå Acknowledgements NGS Exercises were designed by