CPT-S 415 Big Data Yinghui Wu EME B45.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Classification Techniques: Decision Tree Learning
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Classification II.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Classification.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Graph Classification.
Chapter 7 Decision Tree.
Slides are modified from Jiawei Han & Micheline Kamber
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Chapter 9 – Classification and Regression Trees
Basic Data Mining Technique
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
CS690L Data Mining: Classification
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining and Decision Support
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Decision Trees (suggested time: 30 min)
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Data Mining Lecture 11.
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
Discriminative Frequent Pattern Analysis for Effective Classification
Classification and Prediction
Machine Learning Chapter 3. Decision Tree Learning
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
Slides are modified from Jiawei Han & Micheline Kamber
©Jiawei Han and Micheline Kamber
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

CPT-S 415 Big Data Yinghui Wu EME B45

Data Mining & Graph Mining CPT-S 415 Big Data Special topic: Data Mining & Graph Mining Classification Graph Classification discrimitive feature-based classification kernel-based classification

Classification 3

Classification vs prediction predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis

Classification in two-steps Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise “over-fitting” Training Data Build Classification Model Predict Test Data

Existing Classification Methods age? student? credit rating? <=30 >40 no yes 31..40 fair excellent and many more… Decision Tree Support Vector Machine Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea Bayesian Network Neural Network ok 6

Classification by Decision Trees A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

Training Dataset age? overcast student? credit rating? no yes fair excellent <=30 >40 30..40

Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” Over fitting?

Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N (e.g., yes/no) Let the set of examples S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as

Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is The encoding information that would be gained by branching on A Humidity S1 S2 Wind High Low S: 9+, 5- E: 0.94 S: 3+, 4- E: 0.985 S: 6+, 1- E: 0.592 Gain (S, Humidity)= .94- (7/14).985-(7/14).592=.151 Gain (S, Wind)= .94- (8/14).811-(6/14).1.0=.048 S: 6+, 2- E: 0.811 S: 3+, 3- E: 1.00 http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf

Molecular Structures classification Known Unknown A Toxic A D B C Non-toxic A E C D B B E D C B Task: predict whether molecules are toxic, given set of known examples C A E D F

Classification Task: assigning class labels in a discrete class label set Y to input instances in an input space X Ex: Y = { toxic, non-toxic }, X = {valid molecular structures} Misclassified data instance (test error) Unclassified data instances Training the classification model using the training data Assignment of the unknown (test) data to appropriate class labels using the model 14

Graph Classification Methods Graph classification – Structure Based/Frequent feature based Graph classification – Direct Product Kernel Predictive Toxicology example dataset Vertex classification – Laplacian Kernel

Feature based Graph Classification 16

Discriminative Frequent Pattern-Based Classification Feature Construction Model Learning Discriminative Frequent Patterns Training Instances Positive Feature Space Transformation Prediction Model Test Instances Say people focus on feature construction. Mine is a pattern-based feature construction Negative 17

Pattern-Based Classification on Transactions Frequent Itemset Support AB 3 AC BC Attributes Class A, B, C 1 A C A, B A, C B, C Mining Augmented min_sup=3 A B C AB AC BC Class 1 ok

Pattern-Based Classification on Graphs Inactive Frequent Graphs g1 Active g1 g2 Class 1 Transform Mining g2 min_sup=2 Inactive Related work Major ones: good and problems: not confined to rule-based, most discriminative features, any classifier Accurate Emerging patterns 2 slides 2019/4/4 19

Substructure-Based Graph Classification Basic idea Extract graph substructures Represent a graph with a feature vector , where is the frequency of in that graph Build a classification model Different features and representative work Fingerprint Maccs keys Tree and cyclic patterns [Horvath et al.] Minimal contrast subgraph [Ting and Bailey] Frequent subgraphs [Deshpande et al.; Liu et al.] Graph fragments [Wale and Karypis] 20

Recall: two basic frequent pattern mining methods Apriori: Join two size-k patterns to a size-(k+1) pattern Itemset: {a,b,c} + {a,b,d}  {a,b,c,d} Pattern Growth: Depth-first search, grow a size-k pattern to size-(k+1) one by adding one element

. . Fingerprints (fp-n) Hash features to position(s) in a fixed length bit-vector Enumerate all paths up to length l and certain cycles Chemical Compounds . N O O N 1 2 ■ ■ ■ ■ ■ n Fingerprints are typical produced by enumerating all paths up to a certain length in a chemical compound. Certain cycles are also identified. All these substructures are then mapped (hashed) to positions in a fixed length bit-vector. O N 1 2 ■ ■ ■ ■ ■ n .

Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Domain Expert O H N 2 H O Maccs keys were first introduced by MDL Inc. and consists of fragments identified by domain experts that are useful for biological activity. They have been fairly successful set of descriptors in the pharmaceutical industry N H 2 O Identify “Important” Fragments for bioactivity 23

Cycles and Trees (CT) [Horvath et al., KDD’04] Bounded Cyclicity Using Bi-connected components O Identify Bi-connected components Chemical Compound Fixed number of cycles O O NH2 Recently, Cyclic pattern kernels have been employed to classify chemical compounds and were shown to be quite effective. NH2 O Delete Bi-connected Components from the compound Left-over Trees

Frequent Subgraphs (FS) [Deshpande et al., TKDE’05] Discovering Features Topological features – captured by graph representation Chemical Compounds Discovered Subgraphs O Sup:+ve:30% -ve:5% H O N F Frequent Subgraph Discovery Min. Support. F O Sup:+ve:40%-ve:0% Recently Frequent substructures or subgraphs were used as descriptors to classify chemical compounds by Deshpande et al. Frequent substructures are subgraphs that are present in atleast a certain fraction of compounds in the database. This fraction is a user supplied parameter called min support. Sup:+ve:1% -ve:30%

Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution http://www.kernel-machines.org/ 26

Graph Fragments (GF) [Wale and Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF Acyclic fragments are also termed as free trees. N H 2 O OH O

Graph Fragment (GF) [Wale and Karypis, ICDM’06] All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l

Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Main idea Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets

Contrast subgraph example Negative v0(a) Positive v0(a) e0(a) e1(a) e0(a) e1(a) v1(a) v2(a) e2(a) v1(a) v2(a) e2(a) e3(a) e3(a) e4(a) v3(a) v4(a) v3(c) e4(a) Graph A Graph B v0(a) v0(a) Contrast Contrast Contrast e0(a) e1(a) e0(a) e2(a) v3(c) v3(c) v1(a) v2(a) v1(a) Graph C Graph D Graph E

Kernel based Graph Classification 32

Classification with Graph Structures Graph classification (between-graph) Each full graph is assigned a class label Example: Molecular graphs Vertex classification (within-graph) Within a single graph, each vertex is assigned a class label Example: Webpage (vertex) / hyperlink (edge) graphs A WSU domain Faculty B E D C Course Toxic Student

Relating Graph Structures to Classes? Two step process: Devise kernel that captures property of interest Apply kernelized classification algorithm, using the kernel function. Graph kernels looked at Classification of Graphs Direct Product Kernel Classification of Vertices Laplacian Kernel With support vector machines (SVM), one of the more well-known kernelized classification techniques.

Graph Kernels Kernel is a type of similarity function Motivation: K(x, y)>0 is the “similarity” of x and y a feature representation f can define a kernel: f(x) = (f1(x), f2(x)… fk(x)) K(x,y) =f(x)f(y) = sum f_i(x) f_i(y) Motivation: Kernel based learning methods doesn’t need to access data points rely on the kernel function between the data points Can be applied to any complex structure provided a kernel function on them Basic idea: Map each graph to some significant set of patterns Define a kernel on the corresponding sets of patterns

Graph Kernel-Based Classification Feature similarity estimation Model Learning Graph kernel functions Training Instances Positive Feature Mapping & Kernelized similarity Prediction Model Test Instances Say people focus on feature construction. Mine is a pattern-based feature construction Negative More features or more training examples? 36

Walk-based kernels (RW Kernel) Intuition – two graphs are similar if they exhibit similar patterns when performing random walks H I J Random walk vertices heavily distributed towards A,B,D,E Random walk vertices heavily distributed towards H,I,K with slight bias towards L Similar! A B C K L Q R S D E F Random walk vertices evenly distributed Not Similar! T U V

Random walk kernel Basic Idea: count the matching random walks between the two graphs Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, Mahé et al.’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g., 38

Intuition Direct Product Kernel Input Graphs Direct Product Direct Product Vertices     Direct Product   Intuition Direct Product Edges    

Direct Product Graph - example B C A E C D B Type-A Type-B

Direct Product Graph Example Type-A A B C D Type-B A B C D E A B C D E A B C D E A B C D E A B C D E 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 A B C D Intuition: multiply each entry of Type-A by entire matrix of Type-B

Direct Product Graph of Type-A and Type-B Direct Product Kernel   indexed by all possible walks common walks on G1 and G2 are the walks of G1xG2 Direct Product Graph of Type-A and Type-B

Kernel Matrix   Compute direct product kernel for all pairs of graphs in the set of known examples. This matrix is used as input to SVM function to create the classification model. *** Or any other kernelized data mining method

Summary Basics in data mining: what are several basic tasks in data mining? Comparing with statistics? Databases? A wide range of interesting applications (pattern recognition, business rule mining, scientific data…) Graph mining: graph pattern mining, classification, clustering Frequent graph pattern mining Apriori methods Pattern growth Graph classification Feature-based Kernel-based Model learning: decision trees, SVM…

Related techniques Graph mining: Frequent Subgraph Mining Anomaly Detection Kernel –alternatives to the direct product and other “walk-based” kernels. gBoost – extension of “boosting” for graphs Progressively collects “informative” frequent patterns to use as features for classification / regression. Also considered a frequent subgraph mining technique (similar to gSpan in Frequent Subgraph). Tree kernels – similarity of graphs that are trees.

Related techniques Decision Trees Bayesian belief networks Classification model  tree of conditionals on variables, where leaves represent class labels Input space is typically a set of discrete variables Bayesian belief networks Produces directed acyclic graph structure using Bayesian inference to generate edges. Each vertex (a variable/class) associated with a probability table indicating likelihood of event or value occurring, given the value of the determined dependent variables. Support Vector Machines Traditionally used in classification of real-valued vector data. Support kernel functions working on vectors.

Related – Ensemble Classification Ensemble learning: algorithms that build multiple models to enhance stability and reduce selection bias. Some examples: Bagging: Generate multiple models using samples of input set (with replacement), evaluate by averaging / voting with the models. Boosting: Generate multiple weak models, weight evaluation by some measure of model accuracy.

Related techniques – Evaluating, Comparing Classifiers A very brief, “typical” classification workflow: Partition data into training, test sets. Build classification model using only the training set. Evaluate accuracy of model using only the test set. Modifications to the basic workflow: Multiple rounds of training, testing (cross-validation) Multiple classification models built (bagging, boosting) More sophisticated sampling (all)

A big picture Graph Mining Frequent Subgraph Mining (FSM) Variant Subgraph Pattern Mining Applications of Frequent Subgraph Mining Pattern Growth based gSpan MoFa GASTON FFSM SPIN Coherent Subgraph mining Indexing and Search Clustering Approximate methods CSA CLAN Dense Subgraph Mining GraphGrep Daylight gIndex Grafil Classification SUBDUE GBI Apriori based AGM FSG PATH Closed Subgraph mining Kernel Methods (Graph Kernels) CloseCut Splat CODENSE CloseGraph

Papers to review Yan, Xifeng, and Jiawei Han. "gspan: Graph-based substructure pattern mining." Data Mining, 2002. ICDM 2003. Inokuchi, Akihiro, Takashi Washio, and Hiroshi Motoda. "An apriori-based algorithm for mining frequent substructures from graph data." Principles of Data Mining and Knowledge Discovery. Springer Berlin Heidelberg, 2000. 13-23. Yan, Xifeng, and Jiawei Han. "CloseGraph: mining closed frequent graph patterns." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. Kudo, Taku, Eisaku Maeda, and Yuji Matsumoto. "An application of boosting to graph classification." Advances in neural information processing systems. 2004. Kashima, Hisashi, and Akihiro Inokuchi. "Kernels for graph classification."ICDM Workshop on Active Mining. Vol. 2002. 2002. Saigo, Hiroto, et al. "gBoost: a mathematical programming approach to graph classification and regression." Machine Learning 75.1 (2009): 69-89. Horváth, Tamás, Thomas Gärtner, and Stefan Wrobel. "Cyclic pattern kernels for predictive graph mining." KDD, 2004. Ting, Roger Ming Hieng, and James Bailey. "Mining Minimal Contrast Subgraph Patterns." SDM. 2006.