Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEG5010 Presentation Zhou Lanjun.

Similar presentations


Presentation on theme: "SEG5010 Presentation Zhou Lanjun."— Presentation transcript:

1 SEG5010 Presentation Zhou Lanjun

2 A graph-based approach to systematically reconstruct human transcriptional regulatory modules
Xifeng Yan et al. ISMB 2007 SEG5010 Presentation 2019/2/24

3 Problem Gene regulation
includes the processes that cells and viruses use to turn the information in genes into gene products. (Wikipedia) Commonly approach Derive coexpression clusters from a microarray dataset ( ) Mining coexpression clusters from multiple microarray datasets across diverse conditions are more likely to form a transcription module ? SEG5010 Presentation 2019/2/24

4 Problem Mining frequent dense vertexset (FDVS)
The vertex set {d, e, f, g} is a frequent dense vertexset because >80% of the vertex pairs are connected in at least 2 out of the 4 graphs (thick lines) SEG5010 Presentation 2019/2/24

5 Problem Why not directly use the summary graph?
One of the two dense subgraphs in the summary graph, {a, b, c, d}, is not dense in any original graph. Noise may become indistinguishable SEG5010 Presentation 2019/2/24

6 Problem Formulation SEG5010 Presentation 2019/2/24

7 Problem Formulation SEG5010 Presentation 2019/2/24

8 Mining frequent dense vertexsets
Important Observation: Given m graphs, a frequent dense vertexset with density δ and frequency θ must form a subgraph with density >= δθm in the summary graph. We can start from the summary graph and mine its dense subgraphs first SEG5010 Presentation 2019/2/24

9 Mining frequent dense vertexsets cont’d
SEG5010 Presentation 2019/2/24

10 Mining frequent dense vertexsets cont’d
Benefits Significantly shrink the search space Provide a good starting point for the refinement process Defects False patterns Fail in splitting large infrequent dense vertexsets Might break a true dense vertexset in half SEG5010 Presentation 2019/2/24

11 Mining frequent dense vertexsets cont’d
Discussion of noise tolerant G' : noise graph G* : real graph G : observed graph The chance for a noise edge to have weight >=θm in a summary graph is: SEG5010 Presentation 2019/2/24

12 Mining frequent dense vertexsets cont’d
The expected number of k-vertex dense subgraphs that could be formed by noise edges: p=b(m,θ,q) s= k(k−1)/2 P(k,l,d) : the probability that a k-vertex l-edge graph has minimum degree d (derived through simulation) Very sensitive to p SEG5010 Presentation 2019/2/24

13 Mining frequent dense vertexsets cont’d
Solutions: Divide the coexpression graphs into small groups (reduce m) Re-weight summary graph to reduce the weights of noise edges SEG5010 Presentation 2019/2/24

14 Pipeline of NeMo SEG5010 Presentation 2019/2/24

15 Partitioning Group together graphs that likely contain at least one FDVS. SEG5010 Presentation 2019/2/24

16 Re-weight edges in summary graph
Traditional summary graph w : the number of edges shared by the two vertices across m graphs Proposed method : ‘neighbor association’ summary graph Intuition: if two vertices share many small frequent dense subgraphs, likely these two vertices come from the same dense vertexset SEG5010 Presentation 2019/2/24

17 Re-weight edges in summary graph (cont’d)
Graphlets There is a connection between the density of a graph and its k-graphlets SEG5010 Presentation 2019/2/24

18 Re-weight edges in summary graph (cont’d)
Let score(u,v) be the weight of edge(u,v) in a neighbor association graph. If u and v are in the same dense subgraph, score(u,v) should be close to 1 If u and v are not in the same dense subgraph, score(u,v) should be smaller If u and v do not share any dense k-graphlet, score(u,v) should be set to 0 SEG5010 Presentation 2019/2/24

19 Re-weight edges in summary graph (cont’d)
Given two vertices, u and v in a large clique with n vertices, the maximum number of k-graphlets they share is , after normalization this value is: When n >> k, (4) is close to 1 SEG5010 Presentation 2019/2/24

20 Re-weight edges in summary graph (cont’d)
Let πu be the set of frequent dense (k −1)-vertexlets that contain vertex u and πu,v be the set of frequent dense k-vertexlets that contain u and v. Define: Note : score(u,v) is not equal to score(v,u) SEG5010 Presentation 2019/2/24

21 Re-weight edges in summary graph (cont’d)
SEG5010 Presentation 2019/2/24

22 Experiment Setup 105 human microarray datasets
An edge exists between two genes if their expression correlation with a p-value less than 0.01 significant (Zhou et al., 2002) Top 2% (justified by Equation 2) most significant correlations with a p-value less than 0.01 are included in each graph in this study SEG5010 Presentation 2019/2/24

23 Experimental results SEG5010 Presentation 2019/2/24

24 Comparison with other approaches
SEG5010 Presentation 2019/2/24

25 Comparison with other approaches
SEG5010 Presentation 2019/2/24

26 Conclusions A novel graph-based algorithm, NeMo, to efficiently mine the frequent dense vertexsets in a set of coexpression graphs. Demonstration of NeMo’s application in identifying frequent coexpression clusters across many microarray datasets NeMo can also be applied to other biological relational graphs for finding approximate network modules. SEG5010 Presentation 2019/2/24

27 Conserved pathways within bacteria and yeast as revealed by global protein network alignment
Brian P. Kelley et al. PNAS2003, Vol. 100, no. 20 SEG5010 Presentation 2019/2/24

28 Concepts Pathway Refers to a sequence of protein–protein interactions forming a connected path in the network SEG5010 Presentation 2019/2/24

29 Problem Given that protein sequences and structures are conserved in and among species, are networks of protein interactions conserved as well? Is there some minimal set of interaction pathways required for all species? Can we measure evolutionary distance at the level of network connectivity rather than at the level of DNA or protein sequence? Mounting evidence suggests that conserved protein interaction pathways indeed exist and may be ubiquitous SEG5010 Presentation 2019/2/24

30 Method An efficient computational procedure for aligning two protein interaction networks to identify their conserved interaction pathways.(PATHBLAST) SEG5010 Presentation 2019/2/24

31 Overview of The PATHBLAST Algorithm
Two networks are combined into a global alignment graph Vertex: a pair of proteins (one from each) having at least weak sequence similarity (BLAST E<=10-2) Edge: Conserved interaction SEG5010 Presentation 2019/2/24

32 BLAST BLAST is one of the most widely used bioinformatics programs, because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence? Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined? SEG5010 Presentation 2019/2/24

33 The PATHBLAST Algorithm cont’d
Scoring function of a path P: p(v) : probability of true homology with in the protein pair represented by v q(e) : probability that the protein-protein interactions represented by e are real prandom and qrandom : expected values of p(v) and q(e) overall vertices and edges in G SEG5010 Presentation 2019/2/24

34 The PATHBLAST Algorithm cont’d
p(v) is computed using Bayes’ rule H : the event of true homology between the proteins represented by v p(Ev) : the frequency of each E value over all v in G p(Ev|H) : based on E values within the subset of vertices for which both proteins are in the same cluster of orthologous groups (COG) p(H) : overall frequency of vertices with proteins that are in the same COG SEG5010 Presentation 2019/2/24

35 The PATHBLAST Algorithm cont’d
q(e) of each edge is computed from the underlying probabilities of protein-protein interactions it represents. This paper estimate q(e) using the number of independent experimental studies reporting it and then compute as the product of these probabilities SEG5010 Presentation 2019/2/24

36 The PATHBLAST Algorithm cont’d
Alignment Procedure Identify the highest-scoring pathway alignment P* of fixed length L (L vertices and L−1 edges) If G is directed and acyclic, can be accomplished in linear time by using DP Base case is: SEG5010 Presentation 2019/2/24

37 The PATHBLAST Algorithm cont’d
Unfortunately, G is not generally acyclic Construct a sufficient number of directed acyclic subgraphs (5L!) then compute highest-scoring paths for each SEG5010 Presentation 2019/2/24

38 Thanks! Q&A SEG5010 Presentation 2019/2/24


Download ppt "SEG5010 Presentation Zhou Lanjun."

Similar presentations


Ads by Google