Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn.

mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

2 motivation evolution is complex (horizontal gene transfer, hybridization, genetic recombination,...) describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult traditional methods fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteinsmosaic the problem

3 n-grams & dot plots MSKRRMSVGQQTW... "alignment free" methods Split sequence in overlapping subsequences of length n MSKR SKRR KRRM RRMS... 4-grams phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins M S K R R M Q Q V T Q MSKRRMKRRMMSKRRMKRRM n-gram dot plot AB BA S1 S2

4 some real n-gram dot plots 4-grams are "unique" for a sequence we talk about '4' later... c=10 n=4 >AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPE AASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQ QQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCV PEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLS SCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAP TSSKDNYLGGTSTISDSAKELCKAV... c=10 n=4 c=2 n=1

5 another n-gram dot plot nuclear receptors DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBD LBD

6 n-gram sequence similarity s max: global alignment min: local alignment s [0...1] number of shared n-grams S = set of n-grams, e.g. {AAGR, AGRK, GRKQ,...} given two sequences and their n-gram sets S 1 and S 2 : {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ }

7 n-gram similarity fast: linear wrt. size of n-gram sets (classical alignment is quadratic wrt. sequence length) easy to interpret (0.5 = half of the n-grams are shared) no parameters (gap penalty, gap extension penalty,...) can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?) better or worse than BLAST/FASTA? Who knows? (Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

8 why 4 and not 42 Hoehl 2008: n= 3...5 correlation between n-gram sequence similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related and randomly shuffled sequences MR, r=0.93 4

9 phylogenetic networks different node and edge types Identification of reticulate events (e.g. recombination) is error prone computational expensive larger networks become messyT-Rex Makarenkov et al. 2001NeighborNet/SplitsTree Bryant et al. 2004, Huson et al. 1998Newick Cardona et al. 2008

10 larger networks - example Huson et al. 2005Bryant et al. 2004

11 graph = ridiculugram layout dependent distorted distances random initialization local minima slowGRMR PR AR nuclear receptors spring layout

12 mosaic plot point size is similarity no distortions no random initialization preserve full information automatic clustering (spectral rearrangement) no hard decision about number of clusters

13 spectral clustering v 2 : eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated "Degree" matrix Laplacian matrix s ij :n-gram similarity between sequences Affinity matrix σ : defines neighborhood radius σ : defines neighborhood radius eigenvector decomposition e : eigenvalues v : eigenvectors A = exp(-(1-S)**2/sig) D = diag(A.sum(axis=0)) L = D-A e,v = eigh(L)

14 spectral rearrangement

15 recursive spectral rearrangement

16 spectral clustering takes "global" properties into account fast and scales well no random initialization => single run global minimum => single, unique solution few parameters: L, σ σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters) or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is "just another approximation" recursive spectral clustering to improve cluster quality

17 mosaic - demo

18 the end fast technique to visualize/analyze reticulate protein family evolution matrix representation spectral clustering n-gram similarity many other applications Perl free!

19 questions ??

20 SCOP SCOP five families randomly selected

21 Nuclear receptors Ligand binding domainN-terminal sectionZinc-finger domain

22 mosaic - examples

23 Full length sequence: GR MR PR AR MrBayes v3.1.2 10 6 generations, 4 chains 240 CPU-hrs

24 Zinc finger domain AR GR MR PR MrBayes v3.1.2 10 6 generations, 4 chains 9 CPU-hrs

25 Ligand-binding domain PR AR MR GR MrBayes v3.1.2 10 6 generations, 4 chains 27 CPU-hrs

26 Upstream region ? MrBayes v3.1.2 10 6 generations, 4 chains 87 CPU-hrs

27 quality q max: global alignment min: local alignment diag = set of dot sums along diagonals q [0...1] given two sequences and their n-gram dot plot: n = length of sequence

28 q over s

29 q-spectrum

30 n-gram dot plots

Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn.

Similar presentations

Presentation on theme: "Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn.

Similar presentations

Presentation on theme: "Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn."— Presentation transcript:

Similar presentations

About project

Feedback