Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

Similar presentations


Presentation on theme: "Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund."— Presentation transcript:

1 Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund Deshpande)

2 NGDM-02 Outline Mining Scientific Data-sets Opportunities & Challenges Using Graphs and Mining them Pattern Discovery in Graphs Putting Patterns to Good Use Going Forward

3 NGDM-02 Data Mining In Scientific Domain Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and textual data sets. The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life sciences Ecosystem modeling Fluid dynamics Structural mechanics …

4 NGDM-02 Challenges in Scientific Data Mining Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical compounds, etc. Need algorithms that operate on scientific datasets in their native representation

5 NGDM-02 How to Model Scientific Datasets? There are two basic choices Treat each dataset/application differently and develop custom representations/algorithms. Employ a new way of modeling such datasets and develop algorithms that span across different applications! What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics. Labeled directed/undirected topological/geometric graphs and hypergraphs

6 NGDM-02 Modeling Data With Graphs… Going Beyond Transactions Graphs are suitable for capturing arbitrary relations between the various elements. VertexElement Element’s Attributes Relation Between Two Elements Type Of Relation Vertex Label Edge Label Edge Data InstanceGraph Instance Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled

7 NGDM-02 Example: Protein 3D Structure PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor Protein Alzheimer's disease amyloid A4 protein precursor

8 NGDM-02 Example: Fluid Dynamics Vertices  Vortices Edges  Proximity

9 NGDM-02 Graph Mining Goal: Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification) Applications Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering A lot more … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis …

10 NGDM-02 Finding Frequent Patterns in Graphs A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or proteins. Similar arrangements of vortices at different “instances” of turbulent fluid flows. … There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not) Frequent pattern discovery translates to frequent subgraph discovery…

11 NGDM-02 Finding Frequent Subgraphs: Input and Output Input Database of graph transactions. Undirected simple graph (no loops, no multiples edges). Each graph transaction has labels associated with its vertices and edges. Transactions may not be connected. Minimum support threshold σ. Output Frequent subgraphs that satisfy the minimum support constraint. Each frequent subgraph is connected.

12 NGDM-02 FSG Frequent Subgraph Discovery Algorithm Follows an Apriori-style level-by-level approach and grows the patterns one edge-at-a-time.

13 NGDM-02 Computational Challenges Simple operations become complicated & expensive when dealing with graphs… Candidate generation To determine if we can join two candidates, we need to perform subgraph isomorphism to determine if they have a common subgraph. There is no obvious way to reduce the number of times that we generate the same subgraph. Need to perform graph isomorphism for redundancy checks. The joining of two frequent subgraphs can lead to multiple candidate subgraphs. Candidate pruning To check downward closure property, we need subgraph isomorphism. Frequency counting Subgraph isomorphism for checking containment of a frequent subgraph Key to FSG’s computational efficiency: Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to perform identity checks. Uses a sophisticated candidate generation algorithm that reduces the number of times each candidate is generated. Uses an augmented TID-list based approach to speedup frequency counting.

14 NGDM-02 Candidate Generation Based On Core Detection Multiple candidates for the same core!

15 NGDM-02 Candidate Generation Based On Core Detection Multiple cores between two (k-1)-subgraphs

16 NGDM-02 Canonical Labeling v0v0 B v1v1 B v2v2 B v3v3 B v4v4 A v5v5 A Label = “1 01 011 0001 00010” Label = “1 11 100 1000 01000”

17 NGDM-02 DTP Dataset (chemical compounds) (Random 100K transactions)

18 NGDM-02 DTP Dataset

19 NGDM-02 Topology Is Not Enough (Sometimes) Graphs arising from physical domains have a strong geometric nature. This geometry must be taken into account by the data-mining algorithms. Geometric graphs. Vertices have physical 2D and 3D coordinates associated with them.

20 NGDM-02 gFSG — Geometric Extension Of FSG Same input and same output as FSG Finds frequent geometric connected subgraphs Geometric version of (sub)graph isomorphism The mapping of vertices can be translation, rotation, and/or scaling invariant. The matching of coordinates can be inexact as long as they are within a tolerance radius of r. R-tolerant geometric isomorphism. A B C

21 NGDM-02 Use Of Geometry Transformation-invariant signatures enable quick identity checks Normalized sum of distances from the center to each vertex A sorted list of edge angles To compare signatures, use a certain threshold No canonical labeling For the subgraph isomorphism, coordinates of vertices also work as strong constraints to narrow down the search space of combination. Not only the vertex and edge labels, now the coordinates must be matched. R-tolerance makes the problem of finding all patterns extremely hard. Patterns that are 2R-isomorphic to each other will not be counted properly

22 NGDM-02 Performance of gFSG Different number of transactions randomly sampled from DTP dataset Average transaction size about 23 Minimum support 1.0%

23 NGDM-02 Example A discovered pattern NSC 4960 NSC 191370 NSC 40773 NSC 164863 NSC 699181

24 NGDM-02 Putting Patterns to Good Use…

25 NGDM-02 Drug Development Cycle Idea for drug target Drug screening/ rational drug design/ direct synthesis Small scale production Laboratory and animal testing Production for clinical trials File IND

26 NGDM-02 Graph Classification Approach Discover Frequent Sub-graphs 1 Select Discriminating Features 2 Learn a Classification Model 4 Transform Graphs in Feature Representation 3 Graph Databases

27 NGDM-02 Chemical Compound Datasets Predictive Toxicology Challenge (PTC) Predicting toxicity (carcinogenicity) of compounds. Bio assays on four kinds of rodents 4 Classification Problems -- Approx 400 chemical compounds. DTP AIDS Antiviral Screen (AIDS) Predicting anti-HIV activity of compounds. Assay to measure protection of human cells against HIV infection. 3 Classification problems -- Approx 40,000 chemical compounds. Anthrax Predicting binding ability of compounds with the anthrax toxin. Expensive molecular dynamics simulation Collaboration with Dr. Frank Lebeda, USAMRIID Approx 35,000 chemical compounds

28 NGDM-02 Comparison with PTC and HIV (female mice) (HIV screening)

29 NGDM-02 Anthrax

30 NGDM-02 Comparison of Topological & Geometric Features on Anthrax Dataset

31 NGDM-02 Most Discriminating Subbgraphs (a) On Toxicology (PTC) Dataset (b) On AIDS Dataset (c) On Anthrax Dataset

32 NGDM-02 Moving Forwards Graphs provide a powerful mechanism to represent relational and physical datasets. Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem. Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

33 NGDM-02 Research on Pattern Discovery Robust algorithms for mining 3D geometric graphs extensive applications in proteomics Algorithms for finding approximate patterns allow for a limited number of changes there is always variation in the physical world Algorithms to find patterns in single large graphs what is a pattern? what is its support? do we allow for overlap? …

34 NGDM-02 Research on Classification Position specific models a substructure at the surface of the protein is in general more important than the same substructure being buried Efficient Graph-based kernel methods algebra of graphs? …

35 NGDM-02 Research on Clustering Efficient methods to compute graph similarities spectral properties? graph edit distance? Graph consensus representations Multiple graph “alignments” …

36 NGDM-02 Graphs, graphs, and more graphs… Graphs with multi-dimensional labels Stream graphs phone-network connections Hypergraphs compact representation of set relations Benchmarks and real-life test cases!

37 NGDM-02 Thank you! http://www.cs.umn.edu/~karypis


Download ppt "Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund."

Similar presentations


Ads by Google