Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

Slides:

Advertisements

Similar presentations

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Advertisements

gSpan: Graph-based substructure pattern mining

www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

3D Molecular Structures C371 Fall Morgan Algorithm (Leach & Gillet, p. 8)

PharmaMiner: Geometric Mining of Pharmacophores 1.

Introduction to Graph Mining

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

A 3-D reference frame can be uniquely defined by the ordered vertices of a non- degenerate triangle p1p1 p2p2 p3p3.

Data Mining Techniques: Clustering

Association Analysis (7) (Mining Graphs)

Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.

Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington

Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

Graph-Based Data Mining Diane J. Cook University of Texas at Arlington

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.

Object Recognition. Geometric Task : find those rotations and translations of one of the point sets which produce “large” superimpositions of corresponding.

Graph Classification.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Inverse Kinematics for Molecular World Sadia Malik April 18, 2002 CS 395T U.T. Austin.

Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.

Data Mining Techniques

AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University 3D Shape Classification Using Conformal Mapping In.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.

FlowString: Partial Streamline Matching using Shape Invariant Similarity Measure for Exploratory Flow Visualization Jun Tao, Chaoli Wang, Ching-Kuang Shene.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.

An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者：蔡明瑾.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Mining Turbulence Data Ivan Marusic Department of Aerospace Engineering and Mechanics University of Minnesota Collaborators: Victoria Interrante, George.

INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-

1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.

PharmaMiner: Geometric Mining of Pharmacophores 1.

Data Mining and Decision Support

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

+ GRAPH Algorithm Dikompilasi dari banyak sumber.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Graph Indexing From managing and mining graph data.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

Gspan: Graph-based Substructure Pattern Mining

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Outline Introduction State-of-the-art solutions

Mining in Graphs and Complex Structures

INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM

Supervised Time Series Pattern Discovery through Local Importance

Mining Frequent Subgraphs

11/15/2018 Drug Side Effects Data Representation and Full Spectrum Inferencing using Knowledge Graphs in Intelligent Telehealth Presented on Student-Faculty.

On Efficient Graph Substructure Selection

Graph Database Mining and Its Applications

Brief Review of Recognition + Context

Discovering Larger Network Motifs

Presentation transcript:

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund Deshpande)

NGDM-02 Outline Mining Scientific Data-sets Opportunities & Challenges Using Graphs and Mining them Pattern Discovery in Graphs Putting Patterns to Good Use Going Forward

NGDM-02 Data Mining In Scientific Domain Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and textual data sets. The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life sciences Ecosystem modeling Fluid dynamics Structural mechanics …

NGDM-02 Challenges in Scientific Data Mining Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical compounds, etc. Need algorithms that operate on scientific datasets in their native representation

NGDM-02 How to Model Scientific Datasets? There are two basic choices Treat each dataset/application differently and develop custom representations/algorithms. Employ a new way of modeling such datasets and develop algorithms that span across different applications! What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics. Labeled directed/undirected topological/geometric graphs and hypergraphs

NGDM-02 Modeling Data With Graphs… Going Beyond Transactions Graphs are suitable for capturing arbitrary relations between the various elements. VertexElement Element’s Attributes Relation Between Two Elements Type Of Relation Vertex Label Edge Label Edge Data InstanceGraph Instance Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled

NGDM-02 Example: Protein 3D Structure PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor Protein Alzheimer's disease amyloid A4 protein precursor

NGDM-02 Example: Fluid Dynamics Vertices  Vortices Edges  Proximity

NGDM-02 Graph Mining Goal: Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification) Applications Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering A lot more … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis …

NGDM-02 Finding Frequent Patterns in Graphs A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or proteins. Similar arrangements of vortices at different “instances” of turbulent fluid flows. … There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not) Frequent pattern discovery translates to frequent subgraph discovery…

NGDM-02 Finding Frequent Subgraphs: Input and Output Input Database of graph transactions. Undirected simple graph (no loops, no multiples edges). Each graph transaction has labels associated with its vertices and edges. Transactions may not be connected. Minimum support threshold σ. Output Frequent subgraphs that satisfy the minimum support constraint. Each frequent subgraph is connected.

NGDM-02 FSG Frequent Subgraph Discovery Algorithm Follows an Apriori-style level-by-level approach and grows the patterns one edge-at-a-time.

NGDM-02 Computational Challenges Simple operations become complicated & expensive when dealing with graphs… Candidate generation To determine if we can join two candidates, we need to perform subgraph isomorphism to determine if they have a common subgraph. There is no obvious way to reduce the number of times that we generate the same subgraph. Need to perform graph isomorphism for redundancy checks. The joining of two frequent subgraphs can lead to multiple candidate subgraphs. Candidate pruning To check downward closure property, we need subgraph isomorphism. Frequency counting Subgraph isomorphism for checking containment of a frequent subgraph Key to FSG’s computational efficiency: Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to perform identity checks. Uses a sophisticated candidate generation algorithm that reduces the number of times each candidate is generated. Uses an augmented TID-list based approach to speedup frequency counting.

NGDM-02 Candidate Generation Based On Core Detection Multiple candidates for the same core!

NGDM-02 Candidate Generation Based On Core Detection Multiple cores between two (k-1)-subgraphs

NGDM-02 Canonical Labeling v0v0 B v1v1 B v2v2 B v3v3 B v4v4 A v5v5 A Label = “ ” Label = “ ”

NGDM-02 DTP Dataset (chemical compounds) (Random 100K transactions)

NGDM-02 DTP Dataset

NGDM-02 Topology Is Not Enough (Sometimes) Graphs arising from physical domains have a strong geometric nature. This geometry must be taken into account by the data-mining algorithms. Geometric graphs. Vertices have physical 2D and 3D coordinates associated with them.

NGDM-02 gFSG — Geometric Extension Of FSG Same input and same output as FSG Finds frequent geometric connected subgraphs Geometric version of (sub)graph isomorphism The mapping of vertices can be translation, rotation, and/or scaling invariant. The matching of coordinates can be inexact as long as they are within a tolerance radius of r. R-tolerant geometric isomorphism. A B C

NGDM-02 Use Of Geometry Transformation-invariant signatures enable quick identity checks Normalized sum of distances from the center to each vertex A sorted list of edge angles To compare signatures, use a certain threshold No canonical labeling For the subgraph isomorphism, coordinates of vertices also work as strong constraints to narrow down the search space of combination. Not only the vertex and edge labels, now the coordinates must be matched. R-tolerance makes the problem of finding all patterns extremely hard. Patterns that are 2R-isomorphic to each other will not be counted properly

NGDM-02 Performance of gFSG Different number of transactions randomly sampled from DTP dataset Average transaction size about 23 Minimum support 1.0%

NGDM-02 Example A discovered pattern NSC 4960 NSC NSC NSC NSC

NGDM-02 Putting Patterns to Good Use…

NGDM-02 Drug Development Cycle Idea for drug target Drug screening/ rational drug design/ direct synthesis Small scale production Laboratory and animal testing Production for clinical trials File IND

NGDM-02 Graph Classification Approach Discover Frequent Sub-graphs 1 Select Discriminating Features 2 Learn a Classification Model 4 Transform Graphs in Feature Representation 3 Graph Databases

NGDM-02 Chemical Compound Datasets Predictive Toxicology Challenge (PTC) Predicting toxicity (carcinogenicity) of compounds. Bio assays on four kinds of rodents 4 Classification Problems -- Approx 400 chemical compounds. DTP AIDS Antiviral Screen (AIDS) Predicting anti-HIV activity of compounds. Assay to measure protection of human cells against HIV infection. 3 Classification problems -- Approx 40,000 chemical compounds. Anthrax Predicting binding ability of compounds with the anthrax toxin. Expensive molecular dynamics simulation Collaboration with Dr. Frank Lebeda, USAMRIID Approx 35,000 chemical compounds

NGDM-02 Comparison with PTC and HIV (female mice) (HIV screening)

NGDM-02 Anthrax

NGDM-02 Comparison of Topological & Geometric Features on Anthrax Dataset

NGDM-02 Most Discriminating Subbgraphs (a) On Toxicology (PTC) Dataset (b) On AIDS Dataset (c) On Anthrax Dataset

NGDM-02 Moving Forwards Graphs provide a powerful mechanism to represent relational and physical datasets. Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem. Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

NGDM-02 Research on Pattern Discovery Robust algorithms for mining 3D geometric graphs extensive applications in proteomics Algorithms for finding approximate patterns allow for a limited number of changes there is always variation in the physical world Algorithms to find patterns in single large graphs what is a pattern? what is its support? do we allow for overlap? …

NGDM-02 Research on Classification Position specific models a substructure at the surface of the protein is in general more important than the same substructure being buried Efficient Graph-based kernel methods algebra of graphs? …

NGDM-02 Research on Clustering Efficient methods to compute graph similarities spectral properties? graph edit distance? Graph consensus representations Multiple graph “alignments” …

NGDM-02 Graphs, graphs, and more graphs… Graphs with multi-dimensional labels Stream graphs phone-network connections Hypergraphs compact representation of set relations Benchmarks and real-life test cases!

NGDM-02 Thank you!