Download presentation
Presentation is loading. Please wait.
Published byLoren Rice Modified over 9 years ago
1
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison of large sequences (up to 250 000 000) 5 Efficient data search structures and algorithms 6 Proteins...
2
2. Comparison of short sequences (<10.000 bps) Summary (more or less) 2.1 Dot matrix 2.2 Pairwise alignment. 2.3 Hash algorithms. 2.4 Multiple alignment.
3
2. Dot matrix Given two sequences, how we can analyse their degree of identity? By searching those parts that match: S1S1 S2S2 x y 1/0 1 if both characters coincide
4
2. Dot matrix Given two sequences, how we can analyse their degree of identity? By searching those parts that match: S1S1 S2S2 x y S1S1 S2S2 x..x.. y..... 1/0 1 if both characters coincide ?
5
2.1 Dot matrix What is the cost of the algorithm? When are the matchings relevant? accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t L=window length m(i,j)=1 iff S1(i..i+L)=S2(j..j+L): exact matching m(i,j)=1 iff k over L coincide: approximate matching. m(i,j)=k iff k over L coincide: approximate matching
6
2.1. Dot matrix: algorithm cost accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t long(S1)*long(S2)* L in other words O(n 2 L) can long(S1)*long(S2) be possible? can we also say that O(n 2 ) is independent of L?
7
2.1. Dot matrix: signals A: transposons C: Random B: S1=S2 When are signals statistically significant?
8
2.1. Dot matrix: statistical significance: We need to define a random model against which to compare the signals: we define RV: X number of characters that coincide, then Prob(X=k)=comb(L,k) p k (1-p) L-k Given x..x.. y..... S1S1 S2S2 L=window length What is its expected value?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.