Gene and Protein Networks II Monday, April CSCI 4830: Algorithms for Molecular Biology Debra Goldberg
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Summary of network models Random not grown, low clustering, short distances, Poisson degree distribution Regular (lattice) high clustering, long distances Small world high clustering, short distances Scale-freepower law degree distribution Hierarchical high clustering, modular, power law degree distribution
There is information in a gene’s position in the network We can use this to predict Relationships –Interactions –Regulatory relationships Protein function –Process –Complex / “molecular machine”
Confidence assessment Can use topology to assess confidence if true edges and false edges have different network properties Assess how well each edge fits topology of true network Can also predict unknown relations
Prediction A v-w edge would have a high clustering coefficient v w
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Interaction generality Confidence measure for edge based on topology around neighbors. Saito, Suzuki, and Hayashizaki 2002,2003
Confidence assessment Integrate experimental details with local topology –Degree –Clustering coefficient –Degree of neighbors –Etc. Used logistic regression Bader, et al., Nature Biotechnology 2003
The synthetic lethal network has many triangles Xiaofeng Xin, Boone Lab
2-hop predictors for SSL SSL – SSL (S-S) Homology – SSL (H-S) Co-expressed – SSL (X-S) Physical interaction – SSL (P-S) 2 physical interactions (P-P) v w S:Synthetic sickness or lethality (SSL) H:Sequence homology X:Correlated expression P:Stable physical interaction Wong, et al., PNAS 2004
Multi-color motifs S:Synthetic sickness or lethality H:Sequence homology X:Correlated expression P:Stable physical interaction R:Transcriptional regulation Zhang, et al., Journal of Biology 2005
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Computationally predicting protein function Homology Machine Learning Graph-theoretic methods
Majority method Consider immediate neighbors “Guilt by association” –Schwikowski, et al., Nature Biotechnology 2001
Neighborhood method How does frequency affect assignment? Consider a given radius –Hishigaki, et al., Yeast 2001
Minimum Cut methods Minimize interactions between proteins with different annotations –Vazquez, et al., Nature Biotech –Karaoz, et al., PNAS 2004
Functional flow Use network flow algorithm to “transport” function annotation –Nabieva, et al., Bioinformatics 2005
A Markov Random Field method Function prediction based on – Frequency of each function – # neighbors – # of these neighbors with function in question Functional linkage graph Iterate twice – Letovsky and Kasif, Bioinformatics 2003
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Community structure Proteins in a community may be involved in a common process or function Communities are dense subgraphs with sparse interconnections
Hierarchical clustering (1) Using natural edge weights Gene co-expression e.g., Eisen MB, et al., PNAS 1998 from
Hierarchical clustering (2) Adjacency vector Function cluster: Tong et al., Science 2004 Find drug targets: Parsons et al., Nature Biotechnology 2004
Topological overlap A measure of neighborhood similarity l i,j is 1 if there is a direct link between i and j, 0 otherwise Ravasz, et al., Science 2002
Spectral clustering Compute adjacency matrix eigenvectors Each eigenvector defines a cluster: –Proteins with high magnitude contributions Bu, et al., Nucleic Acids Research 2003 positive eigenvaluenegative eigenvalue
Dense subgraphs Spirin and Mirny, PNAS 2003 –Find fully connected subgraphs (cliques), OR –Find subgraphs that maximize density: 2 m / (n (n-1)) Bader and Hogue, BMC Bioinformatics 2003 –Weight vertices by neighborhood density, connectedness –Find connected communities with high weights
“Betweenness” centrality Consider the shortest path(s) between all pairs of nodes “Betweenness” centrality of an edge is a measure of how many shortest paths traverse this edge Edges between communities have higher centrality Girvan, et al., PNAS 2002
Finding motifs
Motif function and aggregation
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Relationships between network data types Distinct data sources generally lead to better inferences. Associations not independent Errors independent
Various methods with varying goals
Incorporating experimental conditions Luscombe, et al., Nature 2004
Party and date hubs Protein interaction network Partition hubs by expression correlation of neighbors Han, et al., Nature 2004
Network connectivity Scale-free networks are: –Robust to random failures –Vulnerable to attacks on hubs Removing hubs quickly disconnects a network and reduces the size of the largest component Albert, et al., Nature 2000
Removing date hubs shatters network into communities Many sub-networks Date Hubs Party Hubs A single main component
Multiple species
Network alignment Across or within species Interaction network and genome sequence e.g., Ogata, et al., Nucleic Acids Research 2000
Outline 1. Recap 2. Confidence assessment, edge prediction (cont’d) 3. Predicting protein function 4. Predicting protein complexes/functional groups 5. Network integration 6. Caveats, cautions, practical issues
Bias: Protein abundance Abundant proteins are –more likely to be represented in some types of experiments –More likely to be essential Correlation between degree (hubs) and essentiality disappears or is reduced when corrected for protein abundance Bloom and Adami, BMC Evolutionary Biology 2003
Bias: Degree correlation Anti-correlation of degrees of interacting proteins disappears in un-biased data Coulomb, et al., Proceedings of the Royal Society B degree k average degree K essential non-essential
Data quality and sparseness
No gold standard Insufficient highly-accurate data Gold-standards often used to train and validate Insufficient standardization of procedures
Significance
Final words Network analysis has become an essential tool for analyzing complex systems –There is still much biologists can learn from scientists in other disciplines –Network analysis is itself a new and evolving field