Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Similar presentations


Presentation on theme: "Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic."— Presentation transcript:

1 Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic

2 ISBRA 20082 Presentation Outline Similarity search in protein sequence databases Smith-Waterman algorithm Common parts  basic algorithm  inversed sequences  inexact search Experiments Conclusion

3 ISBRA 20083 Similarity Measures two strings of amino-acids hamming distance  sequences of equal length  number of non-identical positions edit distance  minimal number of operations insert/update/delete to convert one sequence to the other weighted edit distance  takes into account probability of updating one letter to the other  scoring (substitution) matrices PAM, BLOSUM, …  different costs for opening/extending a gap global/local alignment

4 ISBRA 20084 Global Alignment Global alignment  aligning whole sequences  weighted edit distance Needleman-Wunsch  optimal alignment between 2 sequences a and b  distance matrix δ  gap cost σ  s i,j – optimal alignment of prefixes a and b of length i and j  s 0,j = j*σ, s i,0 = i*σ   s |a|, |b| … value of the optimal alignment NPHGIIMGLAE --HG--LGL-- +8+6 +2+6+4 20 BLOSUM 62 gap cost … -1 O(|a||b|) adding gap to a adding gap to b align a i and b j

5 ISBRA 20085 Local Alignment Local alignment  best global alignment of all pairs of subsequences of a and b Smith-Waterman  modification of Needleman-Wunsch allowing “free ride” from the start by incorporating zero value  s 0,j = 0, s i,0 = 0   max(s i,j ) … value of optimal alignment NPHGIIMGLAE HGL +8+6+2 16 gap extending - σ gap opening - ρ BLOSUM 62 gap cost … -11

6 ISBRA 20086 Speeding-up Database Search non-rigorous search  heuristic approaches trading off accuracy for speed BLAST, FASTA rigorous search  indexing weighted edit distance is not metric in general → MAMs not applicable turning distance to metric – limited to q-grams  parallelism run more alignments concurrently  MPSrch distance computation itself  FPGA (field-programmable gate arrays)  instructions for parallelism

7 ISBRA 20087 Common Alignment Matrices Parts 1. align s i with the query sequence 2. replace s i with s i+1 3. start alignment form (n+1) st row do the same with h and v matrices algorithm stays intact pre-step – sorting prefix ratio (PR) – speed-up

8 ISBRA 20088 Reversed Sequences score of the alignment is independent on the direction of the alignment  possibility of aligning according to suffixes (prefixes of reversed sequences)  division of the database to 2 groups (prefixes, suffixes) – greedy algorithm: 1. building stage divide a given percent of the database randomly and the rest so that PR increases in every step 2. shifting stage move random sequence to oposite group if it would increase the overall PR repeat step 2 n times

9 ISBRA 20089 Inexact Search bigger database (#sequences) → higher PR split sequences  → increase of database size proportional to number of splits  → inaccuracy sequences with alignment spreading over the split might not be in the result any more

10 ISBRA 200810 Experimental Results UniProt DB  max. sequence length 3000 (99,9% of UniProt)  random subset 1.000, 5.000, 10.000, 15.000, 30.000, 50.000, 80.000, 100.000, 200.000, 500.000, 1.000.000  semantically motivated subsets archaea, bacteria, fungi, human, invertebrates, mammals, plants, rodents, vertebrates, viruses Testing of  prefix ratio of basic solution reversed sequences chopped sequences

11 ISBRA 200811 Experiments - Prefix Ratio of Random Subsets and Taxonomic Divisions

12 ISBRA 200812 Experiments – Reversed Sequences after the building stage after the shifting stage without reversed sequences

13 ISBRA 200813 Experiments – Chopped Sequences

14 ISBRA 200814 Conclusion We have proposed  simple method for speeding up the database search of protein sequences by using common prefixes and suffixes easy implementation with current methods  rigorous and non-rigorous version of the algorithm We implemented  modification of Smith-Waterman algorithm Experimental results  we have shown up to 20% speed-up with the rigorous version of the algorithm


Download ppt "Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic."

Similar presentations


Ads by Google