Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Martin Ralbovský Jan Rauch KIZI FIS VŠE. Contents Motivation & introduction Graphs of quantifiers Classes of quantifiers, tables of critical frequencies.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Frequent Closed Pattern Search By Row and Feature Enumeration
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Decision Tree Approach in Data Mining
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
University of Economics, Prague MLNET related activities of Laboratory for Intelligent Systems and Dept. of Information and Knowledge Engineering
1 Test-Cost Sensitive Naïve Bayes Classification X. Chai, L. Deng, Q. Yang Dept. of Computer Science The Hong Kong University of Science and Technology.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Sparse vs. Ensemble Approaches to Supervised Learning
Data Mining Techniques Outline
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.
GUHA - a summary 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method.
Class 3: Estimating Scoring Rules for Sequence Alignment.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Ensemble Learning (2), Tree and Forest
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Basic Data Mining Techniques
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 9 – Classification and Regression Trees
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Development in the Ferda project December 2006 Martin Ralbovský.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
A General Discussion on Functional (Black-box) Testing What are some of the concerns of testers ? –Have we got enough time to test (effort & schedule)?
Ontology-Driven Data Preparation for Data Mining Martin Zeman, KSI MFF UK Martin Ralbovský, KIZI FIS VŠE.
Ferda Visual Environment for Data Mining Martin Ralbovský.
1 CSC 222: Computer Programming II Spring 2004 See online syllabus at: Course goals:
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Exercises Decision Trees In decision tree learning, the information gain criterion helps us select the best attribute to split the data at every node.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague
Relational extensions for GUHA procedures Alexander Kuzmin
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Chapter 7. Classification and Prediction
Algorithms II Software Development Life-Cycle.
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Data Mining Lecture 11.
CIS 210 Systems Analysis and Development
CSc4730/6730 Scientific Visualization
PKDD Discovery Challenge (not only) on Financial Data
Clustering.
Clustering.
Presentation transcript:

Martin Ralbovský KIZI FIS VŠE

The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics One of the main principles of the method is to provide “everything interesting” to the user.

Decision trees One of the most known classification methods There are several known algorithms for construction of decision trees (ID3, C4.5…) Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute Best decision tree in output

Making decision trees GUHA (Petr Berka) A decision tree can be viewed as a GUHA verification/hypothesis But there is only 1 tree in the output Modification of the initial algorithm – ETree procedure We do not branch according to best attribute, but according to n best attributes In each iteration, nodes suitable for branching from existing trees are selected and branched Only sound decision trees to output

ETree parameters (Petr Berka) Criterion for attribute ordering -  2 Trees: Maximal tree depth (parameter) Allow only full length trees Number of attributes for branching Branching: Minimal node frequency Minimal node purity Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: Confusion matrix, F-Measure + any 4ft-quantifier in Ferda

How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} A A B B C C A B B C C Ad a) A A B B C C A A B B C C Ad b) A A B B C C A A B B C C

How to branch II A A B B C C A A B B C C 1 1 A A B B C C 2 2 A A B B C C A A B B C C Ad a)Ad b)

Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); }

Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used Data preparation boxes 4ft-quantifiers can be used to measure quality of trees MiningProcessor (bit string generation engine) usage

ETree task example Existing data preparation boxes 4ft-quantifiers ETree task box

Output + settings example …

Experiment 1 - Barbora Barbora bank, cca cliends, classification of client status from: Loan amount Client district Loan duration Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)

Results - Barbora Tree DepthF-ThresholdVerificationsHypothese s Best hypothesis Performance: 36 verifications/sec

Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: Wilderness area Elevation Slope Horizontal + vertical distance to hydrology Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)

Results – Forest tree cover Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis Attributes for branching: 1 Performance: 39VPS Attributes for branching: 3 Performance: 86VPS Attributes for branching: 5 Performance: 71VPS

Experiment 3: Forest tree cover Construction of trees for whole dataset (cca. 600K) Does increase of number attributes for branching result in better trees? Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 Best hypothesis: 0.52, 2VPS(strings in cache)

Verifications 4FT vs. ETree On tasks on similar data table length 4FT (in Ferda) approx VPS ETree about 70 VPS The ETree verification is far more complicated: In addition to computing quantifier, counting  2 for each node suitable for branching Hard operations (sums) instead of easy operations (conjunctions…) Not only verification of a tree, but construction of trees derived from this tree

Further work How new/known is the method? Boxes for attribute selection criteria Classification box Better result browsing + result reduction Optimizing Elective classification - tree has a vote (Petr Berka) Experiments with various data sources Decision trees from fuzzy attributes Better estimation of relevant questions count