Teresa Przytycka NIH / NLM / NCBI

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Gene duplication models and reconstruction of gene regulatory network evolution from network structure Juris Viksna, David Gilbert Riga, IMCS,
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Best-First Search: Agendas
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Ph.D. NCBI / NLM / NIH.
From Variable Elimination to Junction Trees
CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
HCS Clustering Algorithm
Towards uncovering dynamics of protein interaction networks Teresa Przytycka NIH / NLM / NCBI.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Network topology and evolution of hard to gain and hard to loose attributes Teresa Przytycka NIH / NLM / NCBI.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Evolution of Multidomain Proteins CS 374 – Lecture 10 Wissam Kazan.
Protein Classification A comparison of function inference techniques.
Phylogenetic trees Sushmita Roy BMI/CS 576
Fixed Parameter Complexity Algorithms and Networks.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
Phylogeny Ch. 7 & 8.
Today Graphical Models Representing conditional dependence graphically
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
The NP class. NP-completeness
Lecture 7: Constrained Conditional Models
Prof. Sin-Min Lee Department of Computer Science
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Distance based phylogenetics
Inference in Bayesian Networks
Conservation of Combinatorial Structures in Evolution Scenarios
Character-Based Phylogeny Reconstruction
Data Mining Lecture 11.
Greedy Algorithms / Interval Scheduling Yin Tat Lee
Multiple Alignment and Phylogenetic Trees
NP-Completeness Yin Tat Lee
Minimal Spanning Trees
Recitation 5 2/4/09 ML in Phylogeny
Bart M. P. Jansen June 3rd 2016, Algorithms for Optimization Problems
Stat 217 – Day 28 Review Stat 217.
The Tree of Life From Ernst Haeckel, 1891.
Algorithms (2IL15) – Lecture 2
How to Really Review Papers
Why Social Graphs Are Different Communities Finding Triangles
MATS Quantitative Methods Dr Huw Owens
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CAP 5636 – Advanced Artificial Intelligence
CS 581 Tandy Warnow.
Gene Family Ancestral State Phylogenetic Profiling
CS 188: Artificial Intelligence
NP-Completeness Yin Tat Lee
Phylogeny.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Variable Elimination Graphical Models – Carlos Guestrin
Clustering.
Planting trees in random graphs (and finding them back)
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Teresa Przytycka NIH / NLM / NCBI Network topology and evolution of hard to gain and hard to loose attributes Teresa Przytycka NIH / NLM / NCBI

Modeling network evolution Propose a model Generate a network using the model NO Wrong model! Does the measure agree ? Choose a measure of network topology YES “consistent” model (for the measure) Real network DIMACS, May , 2005

Network Measurements Degree distribution Diameter Clustering coefficient Distribution of connected / bi-connected components Distribution of networks motifs DIMACS, May , 2005

Choosing network measurement Propose a model Generate a network using the model Specific Informative ? Does the measure agree ? Choose measure/property of network topology YES “consistent” model (for the measure) Real network DIMACS, May , 2005

Degree distribution is not specific T. Przytycka, Yi-Kou Yu, short paper ISMB 2004 J. Comp. Biol. and Chem. 2004 Both models agree not only with other on a large interval but also with the data (data points not shown). The real data is in the interval (1,80) DIMACS, May , 2005

Network motifs . . . Fixed size motif Variable size motifs X X hole fan X hole DIMACS, May , 2005

Working with small fixed size motifs Enumerate all small size motifs in a network and observe which are under and over represented and try to understand the reasons. [Alon, Kashtan, Milo, Pinter, ….] Machine learning approach - use small size network motif to train a machine learning program to recognize networks generated by a given model. [Middendorf, Ziv, Wiggins, PANS 2005] DIMACS, May , 2005

Variable size motifs Have to focus on particular families (cannot handle all possibilities) Even within one family motifs can be enumerated efficiently (eg. fans) some are known to be hard – holes, cliques Which motives are may be interesting to examine? THIS TALK: holes and their role in identifying networks that corresponds to hard to gain and hard to loose characters. DIMACS, May , 2005

Character Overlap Graph Given: set of biological units, each described by a set of characters the units evolve by loosing and gaining characters Examples biological units characters multidomain proteins domains genes introns genomes continued DIMACS, May , 2005

Character overlap graph Characters = nodes Two nodes are connected by an edge if there is a unit which contains both characters Domain (overlap) graph T L IG FN3 SH3 SH2 FR -Wuchty 2001 Apic, Huber, Teichmann, 2oo3 -Przytycka, Davis, Song, Durand RECOM 2005 DIMACS, May , 2005

Characters hard to gain and Dollo parsimony Maximum Parsimony: Build a tree with taxa in the leaves and where internal nodes are labeled with inferred character state such that the total number of character insertions and deletion along edges is minimized. Dollo parsimony – only one insertion per character is allowed DIMACS, May , 2005

Example: + + - - # character changes = 9 DIMACS, May , 2005

If it is known that characters are hard to gain then use of Dollo parsimony is justified but if “hard to get” assumption is incorrect Dollo tree can still be constructed. Is there a topological signal that would indicate that the assumption is wrong? DIMACS, May , 2005

Conservative Dollo parsimony Przytycka, Davis, Song, Durand RECOM 2005 Every pair of domains that is inferred to belong to the same ancestral architecture (internal node) is observed in some existing protein (leaf) Motivation: Domains typically correspond to functional unit and multidomain proteins bring these units together for greater efficiency X DIMACS, May , 2005

Przytycka, Davis, Song, Durand RECOM 2005 Theorem: There exits a conservative Dollo parsimony if and only if character overlap graphs does not contain holes. (A graph without holes is also called chordal or triangulated) Przytycka, Davis, Song, Durand RECOM 2005 Comment 1: There exist a fast algorithm for testing chordality (Tarjan, Yannakakis, 1984) Comment 2: Computing Dollo tree is NP-complete (Day, Johnson, Sankoff, 186) DIMACS, May , 2005

Holes and intron overlap graphs Intron data: 684 KOGs (groups of orthologous genes form 8 genomes) [Rogozin, Wolf, Sorokin, Mirkin, Koonin, 2003] Only two graphs had holes. Possible explanations: Most of the graphs are small and it is just by chance? Something else? DIMACS, May , 2005

Assume that characters are hard to gain, if additionally they are hard to loose, what would the character overlap graph be like? DIMACS, May , 2005

Assume parsimony model where each character is gained once and lost at most once. Theorem: If each character gained once and is lost at most once than the character overlap graph is chordal (that is has no holes). Note no “if and only if” DIMACS, May , 2005

Informal justification B C D AB BC CD DA abcd abd bcd a) b) c) DIMACS, May , 2005

Applying the theorem to intron overlap graph and domains overlap graph DIMACS, May , 2005

Before we go further … We have a hammer that is too big for the KOG intron data. Why There are only 8 genomes – NP completeness of computing optimal Dollo tree is not a an issue here We know the taxonomy tree for these 8 species thus all one needs to do is to find the optimal (Dollo) labeling of such three which is computationally easy problem. Such fixed taxonomy tree analysis has been already done (by the group whose data we are using) But we hope to gain an insight into a general principles not particular application DIMACS, May , 2005

Intron data Concatenated intron data has been used (total 7236 introns; 1790 after removing singletons) But now the question if the graph is chordal has an obvious no answer (we seen already two examples while analyzing KOGs separately) Counting all holes is not an option We count holes of size four (squares) DIMACS, May , 2005

Przytycka, Davis, Song, Durand RECOM 2005 Domain Data Przytycka, Davis, Song, Durand RECOM 2005 Superfamily = all multidomain architectures that contain a fixed domain. We extracted 1140 of such superfamilies and constructed domain overlap graph for superfamily separately (considered graphs that have at least 4 nodes only). DIMACS, May , 2005

Null model Have the same number of biological units as the real model Each unit from the null model corresponds to one real unit and has the same number of characters as the real unit but randomly selected. DIMACS, May , 2005

Results Type of character overlap graph Number of squares in real data Number of squares in null model domains 251 55,983 introns 145,555 84,258 DIMACS, May , 2005

Where are these intron holes ? Multiple independent (?) deletions more than in null model C.elegans Arabidop. Plasmod. Drosoph. S.c. S.p. Human Anopheles DIMACS, May , 2005

Compare to the following picture DIMACS, May , 2005

Summary & Conclusions We identified network motifs that are very informative in establishing whether characters are hard to gain and hard to loose. We identified them following a graph theoretical reasoning rather than discovering difference between real and null model and then proposing an explanation. DIMACS, May , 2005

Acknowledgments Conservative Dollo tree theorem is part of a joint work on evolution of multidomain architecture done in collaboration with Dannie Durand and her lab (RECOMB 2005) Lab members Raja Jothi Elena Zotenko And NCBI journal club discussion group DIMACS, May , 2005