Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia Tech, Blacksburg, VA
2 Objective STKE Dataset Cell interactions through chemical signals Discover relationships between the pathways Graph structure Subgraph discovery problem Pathways relationships Clustering Storytelling
Myocyte Adrenergic Pathway ( CMP_9043 )
4 Dataset properties
5 Design Pipeline Preprocessor Frequent Subgraph Discovery Pathway Graphs Frequent Subgraphs Clustering STKE Dataset NNStorytelling
6 Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2] Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges. Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object. m n o l p m n o p q l m n o p q
7 Subsequent Candidate Generation Instance: Number of 5-edge subgraphs: 21 Core subgraph comparisons for s 1 : 20 m n o l p q m n o p l q m n o p m n o l p m o p r n m o l p r n m n o l p m n o l p s m n o p s m n o l p m n o t z Not generated …………………………………………. ……………………………… ………………………………………….
Total Unique Nodes:1205 Total Relations:1376 Master Pathway Graph (MPG)
9 SEG - Subgraph Extension Generation Neighborhood Extension Neighborhood list : {q, r, s} Comparison is not required. Subgraph is extended from physical evidence m n o l p n m o l p s m n o l p q m n o l p r l m n o q p r s
10 Design Pipeline Preprocessor Frequent Subgraph Discovery Pathway Graphs Frequent Subgraphs Clustering STKE Dataset NNStorytelling
11 Subgraph Discovery k# of Subgraphs generated Time (sec.) 11,376Existing 25, , , , min_sup=2% What so novel about pruning edges?
12 ‘Importance Factor’ of a subgraph: sfipf Subgraph frequency, Inverse pathway frequency, For i-th subgraph j-th pathway:
13 Dataset Properties (sfipf) Number of edges in MPG=1376 Total pathways=50
14 Subgraph Discovery
15 Subgraph Discovery
16 Subgraph Discovery kNumber of Subgraphs Time Saved (%) Attempts Saved(%) Overall attempts saved = 89.52% Overall time saved = 99.39%
17 Subgraph Discovery
18 Clustering Hierarchical Agglomerative Clustering (HAC) k-means Unsupervised measure of clusters’ validity Average Silhouette Coefficient (ASC) [19]
19 Clustering
20 Clustering
21 Design Pipeline Preprocessor Frequent Subgraph Discovery Pathway Graphs Frequent Subgraphs Clustering STKE Dataset NNStorytelling
22 Pathway Relations (StoryTelling) Bidirectional Search Cover tree for NN S p1p1 p2p2 p3p3 T p7p7 p8p8 p9p9
Day-to-day life example From Roman Holiday From Terminator 3 From:Roman Holiday To:Terminator 3
24 Examples in STKE
25 Pathway Relations (StoryTelling)
26 Pathway Relations (StoryTelling)
27 Pathway Relations (StoryTelling)
28 Future Directions Compare our SEG graph methods with text based clustering and storytelling Examine costs and benefits for combining text and graph mining techniques
29 References [1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", [2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp [3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, [4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp [5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp [6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp [7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st Asia- Pacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp [8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp [9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp
30 References [10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp [11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp [12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp [13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp [14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp [15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp [16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp [17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp [18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp [19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: , April 2005, pp [20]
31 Thank You