Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

1 Ganesh Iyer Perceptual Mapping XMBA Session 3 Summer 2008.
IITB - Bioinformatics Workshop Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science.
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Partitional Algorithms to Detect Complex Clusters
A Simpler 1.5-Approximation Algorithm for Sorting by Transpositions Tzvika Hartman Weizmann Institute.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Lecture 21: Spectral Clustering
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
CS 584. Review n Systems of equations and finite element methods are related.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Segmentation Graph-Theoretic Clustering.
Pattern Recognition Introduction to bioinformatics 2005 Lecture 4.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Application of Graph Theory to OO Software Engineering Alexander Chatzigeorgiou, Nikolaos Tsantalis, George Stephanides Department of Applied Informatics.
Protein Classification A comparison of function inference techniques.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment.
Chapter 17 Prokaryotic Taxonomy How many species of bacteria are there? How many species can be grown in culture? Bergey’s Manual Classification Schemes.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Calculating branch lengths from distances. ABC A B C----- a b c.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Phylogeny Ch. 7 & 8.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Why use phylogenetic networks?
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
RNA sequence-structure alignment
Biological networks CS 5263 Bioinformatics.
Multiple Alignment and Phylogenetic Trees
Jianping Fan Dept of CS UNC-Charlotte
Building and Analyzing Genome-Wide Gene Disruption Networks
Segmentation Graph-Theoretic Clustering.
Grouping.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
3.3 Network-Centric Community Detection
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane Maetschke/Kassahn

2 motivation evolution is complex (horizontal gene transfer, hybridization, genetic recombination,...) describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult traditional methods fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteinsmosaic the problem

3 n-grams & dot plots MSKRRMSVGQQTW... "alignment free" methods Split sequence in overlapping subsequences of length n MSKR SKRR KRRM RRMS... 4-grams phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins M S K R R M Q Q V T Q MSKRRMKRRMMSKRRMKRRM n-gram dot plot AB BA S1 S2

4 some real n-gram dot plots 4-grams are "unique" for a sequence we talk about '4' later... c=10 n=4 >AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPE AASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQ QQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCV PEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLS SCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAP TSSKDNYLGGTSTISDSAKELCKAV... c=10 n=4 c=2 n=1

5 another n-gram dot plot nuclear receptors DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBD LBD

6 n-gram sequence similarity s max: global alignment min: local alignment s [0...1] number of shared n-grams S = set of n-grams, e.g. {AAGR, AGRK, GRKQ,...} given two sequences and their n-gram sets S 1 and S 2 : {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ }

7 n-gram similarity fast: linear wrt. size of n-gram sets (classical alignment is quadratic wrt. sequence length) easy to interpret (0.5 = half of the n-grams are shared) no parameters (gap penalty, gap extension penalty,...) can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?) better or worse than BLAST/FASTA? Who knows? (Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

8 why 4 and not 42 Hoehl 2008: n= correlation between n-gram sequence similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related and randomly shuffled sequences MR, r=0.93 4

9 phylogenetic networks different node and edge types Identification of reticulate events (e.g. recombination) is error prone computational expensive larger networks become messyT-Rex Makarenkov et al. 2001NeighborNet/SplitsTree Bryant et al. 2004, Huson et al. 1998Newick Cardona et al. 2008

10 larger networks - example Huson et al. 2005Bryant et al. 2004

11 graph = ridiculugram layout dependent distorted distances random initialization local minima slowGRMR PR AR nuclear receptors spring layout

12 mosaic plot point size is similarity no distortions no random initialization preserve full information automatic clustering (spectral rearrangement) no hard decision about number of clusters

13 spectral clustering v 2 : eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated "Degree" matrix Laplacian matrix s ij :n-gram similarity between sequences Affinity matrix σ : defines neighborhood radius σ : defines neighborhood radius eigenvector decomposition e : eigenvalues v : eigenvectors A = exp(-(1-S)**2/sig) D = diag(A.sum(axis=0)) L = D-A e,v = eigh(L)

14 spectral rearrangement

15 recursive spectral rearrangement

16 spectral clustering takes "global" properties into account fast and scales well no random initialization => single run global minimum => single, unique solution few parameters: L, σ σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters) or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is "just another approximation" recursive spectral clustering to improve cluster quality

17 mosaic - demo

18 the end fast technique to visualize/analyze reticulate protein family evolution matrix representation spectral clustering n-gram similarity many other applications Perl free!

19 questions ??

20 SCOP SCOP five families randomly selected

21 Nuclear receptors Ligand binding domainN-terminal sectionZinc-finger domain

22 mosaic - examples

23 Full length sequence: GR MR PR AR MrBayes v generations, 4 chains 240 CPU-hrs

24 Zinc finger domain AR GR MR PR MrBayes v generations, 4 chains 9 CPU-hrs

25 Ligand-binding domain PR AR MR GR MrBayes v generations, 4 chains 27 CPU-hrs

26 Upstream region ? MrBayes v generations, 4 chains 87 CPU-hrs

27 quality q max: global alignment min: local alignment diag = set of dot sums along diagonals q [0...1] given two sequences and their n-gram dot plot: n = length of sequence

28 q over s

29 q-spectrum

30 n-gram dot plots