Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute
Outline of the Talk: Research Background SSAHA – The Fastest Sequence Search Engine - Hash table; - Sequence search based on the hash table; - Various applications. Euler Path – consensus generation - Euler Path; - Consensus generation; - SNP calling. Phusion – the WGS assembler: - Phusion pipeline; - Reads grouping; - Applications. Current Research
Powder Simulation
Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN
Sequence Search and Alignment Algorithms - Dynamic programming; - Suffix tree; - Hash method; - … Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - … CPU vs Memory
Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (ii)To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (i)To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (iii)To provide possible tools for sequence analysis based on the search engine.
Automatic Sequencing ATGCAGGTCC …….
Sequence Representation Sequence S: (s 1 s 2, …, s i, …, s m ) i =1,2, …, m K-tuple: (s i s i+1...s i+k-1 ) Using two binary digits for each base, we may have the following representations: “A” =00; “C” = 01; “G” = 10; “T” = 11 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where i = 0 or 1, depending on the value of the sequence base and E max is the maximum value of the possible E values. SSAHAIndex:
Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Hash Table : A 2-tuple hashing table of S1, S2 and S3
Query sequence: S q = (TGCAACAT) Ek-tupleNiNi Indices and Offsets 0AA12, 19 1AC31, 92, 52, 11 2AG21, 152, 35 3AT22, 133, 3 4CA72, 32, 92, 212, 272, 333, 213, 23 5CC41, 212, 313, 53, 7 6CG11, 5 7CT61, 232, 392, 433, 133, 153, 17 8GA41, 31, 172, 152, 25 9GC0 10GG51, 251, 312, 172, 293, 1 11GT61, 11, 271, 292, 12, 373, 19 12TA13, 25 13TC61, 71, 111, 192, 232, 413, 11 14TG31, 132, 73, 9 15TT
k-tuplesf(t)F(t)-(t-1)F s (t) TG1, 13 01, 5 2, 7 01, 13 3, 9 02, -2 GC CA2, 32, 1-22, 1 2, 92, 7-22, 1 2, 212, 19-22, 4 2, 272, 25-22, 7 2, 332, 31-22, 7 3, 213, 19-22, 7 3, 233, 21-22, 7 AA2, 192, 16-32, 16 AC1, 91, 5-42, 16 2, 52, 1-42, 19 2, 112, 7-42, 21 CA2, 32, -2-52, 25 2, 92, 4-52, 28 2, 212, 16-52, 31 2, 272, 22-53, -3 2, 332, 28-53, 9 3, 213, 16-53, 16 3, 233, 18-53, 18 AT2, 132, 7-63, 19 3, 33, -3-63, 21 Array of index and offset data S q = (TGCAACAT) Query sequence:
In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),N t )} with H(E(t),i) = 2 32 H 1 (E(t),i) + H 2 ’ (E(t),i)i = 1,2,…, N t 64 Bit Machines It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer. Index Offset
Power Law: CPU time v query length Fig. 1 Normalized CPU time plotted against the number of k- tuples in query (k=12) using Quicksort.
SSAHA Memory Memory for subject: M s = 4*N s /k+ 4*2 2k Memory for query: M q = N q House keeping: 10-20% total Total memory: M s = 1.2*(M s +M q )
SSAHA 2 Client SSAHA2 Client The SSAHA Trace Server It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.
The Seven Bridges of Konigsberg.... a c b d a b d c Pregel River During the 18 th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions. Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?
Vertex Degree, Euler Circuit and Euler Path Vertex degree: For an undirected graph G, the vertex degree is defined as the number of edges in the graph. Euler circuit: For an undirected graph G, if there is a circuit in G that traverses every edge of the graph exactly once, then G is said to have an Euler circuit. a e c d b f Euler path: If there is an open trail from a to c in G and this trails traverses each edge in G exactly once, the the trail is called an Euler trail or Euler path.
Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.
Sequence Reconstruction - Euler path approach Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. AT GT CG CA GC TG GG ATGCGTGGCA ATGGCGTGCA ATGGCGTGCA ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA
Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
Point to the Next - Hash Table Links S1=(ATGCAGGTCC), S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC) Ek-tuplesIndices, Offsets and links to the next 7ATG1,1,28 3,1,284,1,28 8ATC 2,1,29 10AGT 4,5,38 11AGG1,5,422,4,423,6,42 19TAG 3,5,11 24TTC 4,7,32 28TGC1,2,45 3,2,464,2,45 29TCA 2,2,51 32TCC1,8,-12,7,-13,9,-14,8,-1 38GTT 4,6,24 40GTC1,7,322,6,323,8,32 42GGT1,6,402,5,403,7,40 45GCA1,3,51 4,3,51 46GCT 3,3,53 51CAG1,4,112,3,11 4,4,10 52CAC 3,4,19
Consensus ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC CONS=(ATGCAGGTCC) ATGC--AGGTCCAT--C--AGGTCCATGCTAGGTCCATGC--AGTTCCATGC--AGGTCC
eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly occurred base pair on the location. ATGC--AGGTCCATGC--AGGTCC AT T CCAGGTCC AT T C--AGCTCC ATGCTAGGTCCATGCTAGGTCC ATGC--AGGTCCATGC--AGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGCTAGGTCCATGCTAGGTCC
Phusion Assembler Pipeline Reads Group Data Process RPphrap - Contig Shotgun Reads Read-pair Tracker Supercontig FPC Mapping RPjoin –Merge PRono Assembly
Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing
Zebrafish as a model organism n Danio rerio n Fish length: 3 cm long Estimated genome size: 1.55 Gb n Easy to maintain short generation time can be kept at high densities n Easy to manipulate external fertilisation and development transparent embryos Sanger Institute WGS project started in spring DNA sourceTuebingen embryos; - WGS read Insert sizes: kb; - BACends insert sizes: 165 – 175 kb; - Polymorphism: ~ day old embryos; - SNP density: One in every 200 bps; - Indel density: One in every 1500 bps; - Indel length: 2 – 30 bps.
Acknowledgements: Jim Mullkin Yong Gu Adam Spargo Richard Durbin Kerstin Jekosch Sean Humphray Jane Rogers Sanger Systems Support Sanger Sequencing Facilities