Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

BLAST Sequence alignment, E-value & Extreme value distribution.

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Next Generation Sequencing, Assembly, and Alignment Methods

TEMPLATE DESIGN © SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

Heuristic Approaches for Sequence Alignments

Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.

A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.

A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.

Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

From Smith-Waterman to BLAST

Doug Raiford Phage class: introduction to sequence databases.

2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.

Example of a Hash Table (Ning, 2001) Introduction Genomes Available for Comparison Using SSAHA Online at

Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.

Local alignment and BLAST Usman Roshan BNFO 601. Local alignment Global alignment recursions: Local alignment recursions.

Heuristic Alignment Algorithms Hongchao Li Jan

1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.

Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.

SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

A Music Search Engine for Plagiarism Detection

A database index to large biological sequences

Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Sequence comparison: Local alignment

Local alignment and BLAST

BIOINFORMATICS Fast Alignment

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics

Outline of the Talk:  The need for fast search engine;  SSAHA – Sequence Search and Alignment using Hashing Algorithm;  Hash table;  Sequence search based on the hash table;  Search speed;  Memory requirement;  How to use the package.

Algorithms and Software Tools  Algorithms - Dynamic programming; - Hash method; - Suffix tree; - …  Software tools - FASTA; - BLAST; - Cross_Match; - Mummer; - …  CPU vs Memory

Smith-Waterman Algorithm n Only works effectively when gap penalties are used n In example shown –match = +1 –mismatch = -1/3 –gap = -1+1/3k (k=extent of gap) n Start with all cell values = 0 n Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account H ij =max{H i-1, j-1 +s(a i,b j ), max{H i-k,j -W k }, max{H i, j-l -W l }, 0}

Mapping the string ababc into a suffix tree. ab abc c b c c root Suffix Tree Example

Motivation for sequence indexing –faster (economy) –remove reliance on the external service and network delays (user independence) –integrate fully with a database engine (convenience) –exhaustive instead of heuristics (quality) –enable different statistics in sequence evaluation (flexibility)

Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (ii)To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (i)To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (iii)To provide possible tools for sequence analysis based on the search engine.

Sequence Representation Sequence S: (s 1 s 2, …, s i, …, s m ) i =1,2, …, m K-tuple: (s i s i+1...s i+k-1 ) Using two binary digits for each base, we may have the following representations: “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where  i = 0 or 1, depending on the value of the sequence base and E max is the maximum value of the possible E values.

Overlap Hashing W = N/k ATGGGCAGATGT CCATGTTCGGAT CCATGTTCGGAT CATTACGTAAGC CATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Non-overlap hashing W = N-k+1 W = N-k+1 (k = 12) Non-overlap Hashing v Overlap Hashing

Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Hash Table : A 2-tuple hashing table of S1, S2 and S3

Query sequence: S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT

k-tuplesf(t)F(t)-(t-1)F s (t) TG1, 13 01, 5 2, 7 01, 13 3, 9 02, -2 GC CA2, 32, 1-22, 1 2, 92, 7-22, 1 2, 212, 19-22, 4 2, 272, 25-22, 7 2, 332, 31-22, 7 3, 213, 19-22, 7 3, 233, 21-22, 7 AA2, 192, 16-32, 16 AC1, 91, 5-42, 16 2, 52, 1-42, 19 2, 112, 7-42, 21 CA2, 32, -2-52, 25 2, 92, 4-52, 28 2, 212, 16-52, 31 2, 272, 22-53, -3 2, 332, 28-53, 9 3, 213, 16-53, 16 3, 233, 18-53, 18 AT2, 132, 7-63, 19 3, 33, -3-63, 21 Array of index and offset data S q = (TGCAACAT) Query sequence:

S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT

Sequence Search Sequence search is carried out using the generated hash table. Suppose we have a query sequence with length n, S q = (s 1, s 2, s 3,...,s n ), and we want to find whether this sequence is one of the sequences in the database or a small segment of the sequence. Based on S q, we have an integer array using where t = 1, 2, …, n+1-k. Note that overlapping for the query sequence is allowed while making the above array. For each element E(t), there are two arrays of sequence index and offset data with a length of entry repeats N t in the hash table: E(t) = (E 1, E 2, …, E t, … E n+1-k ) f 1 (t) = {H 1 (E(t),1), H 1 (E(t),2), …, H 1 (E(t),N t, )} f 2 (t,g) = {H 2 (E(t),1), H 2 (E(t),2), …, H 2 (E(t),N t, )}

F 1 (t) = f 1 (t) F 2 (t) = {H 2 ’ (E(t),1), H 2 ’ (E(t),2), …, H 2 ’ (E(t),N t )} with H 2 ’ (E(t),i) = H 2 (E(t),2)-(t-1) i = 1,2,…, N t The above calculation to adjust offsets should be done for every element in the array. Frequency Array Subject Query t-1 Match Start Reference Point t-1 Match Start Reference Point

In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),N t )} with H(E(t),i) = 2 32 H 1 (E(t),i) + H 2 ’ (E(t),i)i = 1,2,…, N t 64 Bit Machines It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer. Index Offset

For the query sequence, there are n+1-k arrays in total and it is necessary that we combine all the arrays into one single arrays and F = {F (1), F(2),…, F(t),…, F(n+1-k)} Finally when the array is sorted into an ascending order, i.e. F -> F s with F s,1 < F s,2 < … < F s,i < … the search results can be determined by the number of the data repeats in the array. In a section within the F s array, if the found repeat level is higher than a given threshold level, this means that there is a match between the query sequence and sequences in the database. Array Sorting

Power Law: CPU time v query length Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort. Averaged length of frequency array: where N i is the average length of the entruy repeats. ^

Query file: 39,000 reads 39,000 reads Speed and Resolution – Effects of k Subject file: 1.5 Gbp of human DNA kE max +1CPU (Get hash table) T 1 (s)CPU (Search only) T 2 (s)* 865, , ,048, ,194, ,777, ,108, ,435,

SSAHA Memory Memory for subject: M s = 4*N s /k+ 4*2 2k Memory for query: M q = N q House keeping: 10-20% total Total memory: M s = 1.2*(M s +M q )

R i +j R i+1 RiRiRiRi SSAHA Memory: One array combined read index and offset

Matching Positions Found by SSAHA Subject Query t-1 Match Start Reference Point t-1 Match Start Reference Point

SSAHA2 = SSAHA + Cross_Match SSAHA for matching seeds, cross_match for sequence alignment. SSAHA seeds Edge length Sequence for cross_match Edge length

SSAHA2 Command Line./ssaha2 query_file subject_file Options: -kmer: length of kmer words;default kmer=12 -seeds:number of exact kmer words;default seeds=10 -align: '1' - show full alignment; '0' - no alignment;default '1' -sense: '1' - search with higher sensitivity; '0' - normal;default '0' -tags: '1' - show a tag of 'ALIGNMENT'; '0' - no tag;default '0' -depth: number of reported hits with best alignment;default depth=50 -score: minimum score of smith-waterman;default score=30 -cut: number of word occurrence in the dataset; default cut=200 -memory: memory assigned in MBs for cross_match;default memory=2000 -array: memory assigned in MBs for storing frequence arrays;default memory=4 -edge: extension of both ends on the subject;default edge=200 -best: report the best alignment from the hit list;default '0' -start: start read from the query file;default start=0 -end: end read from the query file; default start= Total number of the reads in the query file; -kmer: length of kmer words;default kmer=12 -seeds:number of exact kmer words;default seeds=10 -align: '1' - show full alignment; '0' - no alignment;default '1' -sense: '1' - search with higher sensitivity; '0' - normal;default '0' -tags: '1' - show a tag of 'ALIGNMENT'; '0' - no tag;default '0' -depth: number of reported hits with best alignment;default depth=50 -score: minimum score of smith-waterman;default score=30 -cut: number of word occurrence in the dataset; default cut=200 -memory: memory assigned in MBs for cross_match;default memory=2000 -array: memory assigned in MBs for storing frequence arrays;default memory=4 -edge: extension of both ends on the subject;default edge=200 -best: report the best alignment from the hit list;default '0' -start: start read from the query file;default start=0 -end: end read from the query file; default start= Total number of the reads in the query file;

COOKBOOK BACends placement - find the best hit in the database: -seeds 14 -kmer 13 -align 0 -tags 1 -depth 5 -score 200 -cut 50000; EST/cDNA alignment - produce splice on the subject sequence: -seeds 4 -kmer 13 -align 0 -tags 1 -depth 5 -score 20 -edge 20000; Primer/gene Marks alignment - find the matches of short motifs to the database: -seeds 1 -kmer 13 -tags 1 -score 12 -skip 1 -sense 1 -cut 50000; Search with higher sensitivity: -seeds 2 -kmer 13 -tags 1 -score 20 -sense 1 -cut 50000; Both query and subject are large (q: 100Kb < query < 1MB; s: no limit): -seeds 50 -kmer 13 -tags 1 -score array 40 -memory 10000;

Summary:  Speed - Fast enough to perform genomic scale searches between large genomes;  Memory – linear;  Sensitivity – not as good as BLAST, but applicable in assembly and SNP detection;