Download presentation
Presentation is loading. Please wait.
Published byShawn Porter Modified over 9 years ago
1
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The Kent State University Presenter: Fan Wang The Ohio State University
2
Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion
3
Introduction Huge number of genes in literature Huge number of genes in literature Associated with targeted disease or functionality Associated with targeted disease or functionality Finding interaction among genes manually Finding interaction among genes manually –Time consuming –Error Prone
4
Introduction Well-known relationship among chemokine ligands Well-known relationship among chemokine ligands Mining these relations from literature documents Mining these relations from literature documents Mining frequent patterns from graph datasets Mining frequent patterns from graph datasets –Convenient representation –Lots of research in subgraph mining
5
Introduction Our Goal Our Goal –Find commonly occurring interactions –Represent them visually Capture the co-occurrence of scientific terms Capture the co-occurrence of scientific terms Graph representation of scientific document Graph representation of scientific document Mining frequent topological structures Mining frequent topological structures
6
Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion
7
Topological Structure Mining Disadvantages of subgraph mining Disadvantages of subgraph mining –Exact matching –Missing potential patterns Focusing on the topological relationship Focusing on the topological relationship Incorporating approximate matching Incorporating approximate matching
8
Topological Structure Mining Y GX G is a subgraph of Y X is a (0,3) topological structure of Y
9
Topological Structure Mining Definition Definition –Given a collection of graphs, two parameters l and h, and a threshold θ. A (l,h)-topological structure whose support is greater than or equal to θis called a frequent topological structure. Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented
10
Our Work Using topological structure mining Using topological structure mining Challenges Challenges –How to create graphs? –What are the keywords? –How to insert edges into graphs?
11
Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion
12
Data Preprocessing and Graph Representation One graph for each document One graph for each document Nodes are keywords of interest Nodes are keywords of interest Edges inserted based on occurrence of the keywords Edges inserted based on occurrence of the keywords Run topological structure mining algorithm Run topological structure mining algorithm
13
Data Preprocessing Four dictionaries of keywords Four dictionaries of keywords –Short Dictionary 321 genes expressed between prostate epithelial and stromal cells 321 genes expressed between prostate epithelial and stromal cells –Long Dictionary 2600 human genes found in supperarray ’ s DNA microarray experiment 2600 human genes found in supperarray ’ s DNA microarray experiment –Confusion Dictionary Gene names easily confused with ordinary words Gene names easily confused with ordinary words –GO Dictionary GO terms (molecular function, biological process and cellular component) GO terms (molecular function, biological process and cellular component)
14
Graph Representations Edge Construction Methods Edge Construction Methods –Sentence-based Method Two keywords in one sentence Two keywords in one sentence –Mutual Information Method The mutual information of two keywords greater than a threshold The mutual information of two keywords greater than a threshold –Sliding Window Method Two keywords located within a sliding window with a pre- defined size Two keywords located within a sliding window with a pre- defined size
15
Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion
16
Experiment Results Focusing on articles containing at least one of the 5 genes Focusing on articles containing at least one of the 5 genes –CCL5, TF, IGF1, MYLK, IGFBP3 Generating graph for each article Generating graph for each article Finding frequent topological structures Finding frequent topological structures
17
Three Edge Construction Methods
20
Results Sliding window method wins Sliding window method wins –Largest number of frequent patterns –Best scalability Topological structure mining giving us more frequent patterns Topological structure mining giving us more frequent patterns Large number doesn ’ t mean high biological significance Large number doesn ’ t mean high biological significance
21
Pattern Analysis ONLY be found by topological structure mining ONLY be found by topological structure mining ONLY be found by sliding window method ONLY be found by sliding window method Restoring nodes revealing interesting patterns Restoring nodes revealing interesting patterns
22
Outline Introduction Introduction Topological Structure Mining Topological Structure Mining Data Preprocessing and Graph Representations Data Preprocessing and Graph Representations Experiment Results and Pattern Analysis Experiment Results and Pattern Analysis Conclusion Conclusion
23
Conclusion Sliding window method is the best Sliding window method is the best –The most number of frequent patterns –The highest quality of frequent patterns Topological structures found corresponding well to known relationships Topological structures found corresponding well to known relationships Topological mining being a very valuable tool for biological researchers Topological mining being a very valuable tool for biological researchers
24
Three Edge Construction Methods Interestingness of Edges Interestingness of Edges –Counting the number of distinct edges –Computing the average interestingness of edges for all patterns found by using each edge construction method
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.