Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Section 14.1 Intro to Graph Theory. Beginnings of Graph Theory Euler’s Konigsberg Bridge Problem (18 th c.)  Can one walk through town and cross all.
Introduction This chapter explores graphs and their applications in computer science This chapter explores graphs and their applications in computer science.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Next Generation Sequencing, Assembly, and Alignment Methods
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
TEMPLATE DESIGN © SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Graphs. Graph A “graph” is a collection of “nodes” that are connected to each other Graph Theory: This novel way of solving problems was invented by a.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Chapter 2 Graph Algorithms.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
CS 200 Algorithms and Data Structures
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Outline More exhaustive search algorithms Today: Motif finding
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Lecture 14: Graph Theory I Discrete Mathematical Structures: Theory and Applications.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
Aim: What is an Euler Path and Circuit?
Discrete Mathematical Structures: Theory and Applications
MAT 2720 Discrete Mathematics Section 8.2 Paths and Cycles
Lecture 11: 9.4 Connectivity Paths in Undirected & Directed Graphs Graph Isomorphisms Counting Paths between Vertices 9.5 Euler and Hamilton Paths Euler.
Chapter 6: Graphs 6.1 Euler Circuits
Example of a Hash Table (Ning, 2001) Introduction Genomes Available for Comparison Using SSAHA Online at
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
Phusion2 and The Genome Assembly of Tasmanian Devil
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Graph Algorithms in Bioinformatics
Genome Assembly.
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute

Outline of the Talk:  Research Background  SSAHA – The Fastest Sequence Search Engine - Hash table; - Sequence search based on the hash table; - Various applications.  Euler Path – consensus generation - Euler Path; - Consensus generation; - SNP calling.  Phusion – the WGS assembler: - Phusion pipeline; - Reads grouping; - Applications.  Current Research

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN

Sequence Search and Alignment  Algorithms - Dynamic programming; - Suffix tree; - Hash method; - …  Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - …  CPU vs Memory

Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (ii)To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (i)To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (iii)To provide possible tools for sequence analysis based on the search engine.

Automatic Sequencing ATGCAGGTCC …….

Sequence Representation Sequence S: (s 1 s 2, …, s i, …, s m ) i =1,2, …, m K-tuple: (s i s i+1...s i+k-1 ) Using two binary digits for each base, we may have the following representations: “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where  i = 0 or 1, depending on the value of the sequence base and E max is the maximum value of the possible E values. SSAHAIndex:

Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Hash Table : A 2-tuple hashing table of S1, S2 and S3

Query sequence: S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT

k-tuplesf(t)F(t)-(t-1)F s (t) TG1, 13 01, 5 2, 7 01, 13 3, 9 02, -2 GC CA2, 32, 1-22, 1 2, 92, 7-22, 1 2, 212, 19-22, 4 2, 272, 25-22, 7 2, 332, 31-22, 7 3, 213, 19-22, 7 3, 233, 21-22, 7 AA2, 192, 16-32, 16 AC1, 91, 5-42, 16 2, 52, 1-42, 19 2, 112, 7-42, 21 CA2, 32, -2-52, 25 2, 92, 4-52, 28 2, 212, 16-52, 31 2, 272, 22-53, -3 2, 332, 28-53, 9 3, 213, 16-53, 16 3, 233, 18-53, 18 AT2, 132, 7-63, 19 3, 33, -3-63, 21 Array of index and offset data S q = (TGCAACAT) Query sequence:

In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),N t )} with H(E(t),i) = 2 32 H 1 (E(t),i) + H 2 ’ (E(t),i)i = 1,2,…, N t 64 Bit Machines It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer. Index Offset

Power Law: CPU time v query length Fig. 1 Normalized CPU time plotted against the number of k- tuples in query (k=12) using Quicksort.

SSAHA Memory Memory for subject: M s = 4*N s /k+ 4*2 2k Memory for query: M q = N q House keeping: 10-20% total Total memory: M s = 1.2*(M s +M q )

SSAHA 2 Client SSAHA2 Client The SSAHA Trace Server It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.

The Seven Bridges of Konigsberg.... a c b d a b d c Pregel River  During the 18 th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions.  Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?

Vertex Degree, Euler Circuit and Euler Path Vertex degree: For an undirected graph G, the vertex degree is defined as the number of edges in the graph. Euler circuit: For an undirected graph G, if there is a circuit in G that traverses every edge of the graph exactly once, then G is said to have an Euler circuit. a e c d b f Euler path: If there is an open trail from a to c in G and this trails traverses each edge in G exactly once, the the trail is called an Euler trail or Euler path.

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA

Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

Point to the Next - Hash Table Links S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC) Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19

Consensus ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC CONS=(ATGCAGGTCC) ATGC--AGGTCCAT--C--AGGTCCATGCTAGGTCCATGC--AGTTCCATGC--AGGTCC

eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly occurred base pair on the location. ATGC--AGGTCCATGC--AGGTCC AT T CCAGGTCC AT T C--AGCTCC ATGCTAGGTCCATGCTAGGTCC ATGC--AGGTCCATGC--AGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGCTAGGTCCATGCTAGGTCC

Phusion Assembler Pipeline Reads Group Data Process RPphrap - Contig Shotgun Reads Read-pair Tracker Supercontig FPC Mapping RPjoin –Merge PRono Assembly

Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

Zebrafish as a model organism n Danio rerio n Fish length: 3 cm long Estimated genome size: 1.55 Gb n Easy to maintain short generation time can be kept at high densities n Easy to manipulate external fertilisation and development transparent embryos Sanger Institute WGS project started in spring DNA sourceTuebingen embryos; - WGS read Insert sizes: kb; - BACends insert sizes: 165 – 175 kb; - Polymorphism: ~ day old embryos; - SNP density: One in every 200 bps; - Indel density: One in every 1500 bps; - Indel length: 2 – 30 bps.

Acknowledgements:  Jim Mullkin  Yong Gu  Adam Spargo  Richard Durbin  Kerstin Jekosch  Sean Humphray  Jane Rogers  Sanger Systems Support  Sanger Sequencing Facilities