Download presentation
Presentation is loading. Please wait.
Published byAlize Tatham Modified over 9 years ago
1
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006
2
Contents Prerequisites Correlation of the SVM kernel’s parameters Polynomial kernel Gaussian kernel Feature selection using Genetic Algorithms Chromosome encoding Genetic operators Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods Selection based on Euclidean distance Selection based on cosine Initial data set scalability Choosing training and testing data sets Conclusions and further work
3
Prerequisites Reuters Database Processing 806791 total documents, 126 topics, 366 regions, 870 industry codes Industry category selection – “ system software ” 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics) Data representation Binary Nominal Cornell SMART Classifier using Support Vector Machine techniques kernels
4
Correlation of the SVM kernel’s parameters Polynomial kernel Gaussian kernel
5
Polynomial kernel Commonly used kernel d – degree of the kernel b – the offset Our suggestion b = 2 * d Polynomial kernel parameter’s correlation
6
Bias – Polynomial kernel
7
Gaussian kernel parameter’s correlation Gaussian kernel Commonly used kernel C – usually represents the dimension of the set Our suggestion n – numbers of distinct features greater than 0
8
n – Gaussian kernel auto
9
Feature selection using Genetic Algorithms Chromosome Fitness (c i ) = SVM (c i ) Methods of selecting parents Roulette Wheel Gaussian selection Genetic operators Selection Mutation Crossover
10
Methods of selecting the parents Roulette Wheel each individual is represented by a space that corresponds proportionally to its fitness Gaussian : maxim value (m=1) and dispersion (σ = 0.4)
11
The process of obtaining the next generation
12
GA_FS versus SVM_FS for 1309 features
13
Training time, polynomial kernel, d= 2, NOM
14
GA_FS versus SVM_FS for 1309 features
15
Training time, Gaussian kernel, C=1.3, BIN
16
Meta-classifier with SVM Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart Upper limit (94.21%)
17
Meta Classifier selection 475130924888000 BINNOMSMARTBINNOMSMARTBINNOMSMARTBINNOMSMART P1.082.6986.6482.2281.4586.6980.9982.3586.3082.0980.7181.9580.93 P2.085.2886.5285.1186.6485.0387.1186.4785.7586.6485.9685.3786.01 P3.085.5485.6285.5485.7984.3586.5185.2884.5685.1184.4384.6482.60 P4.081.6285.7979.4174.6181.5471.8478.9981.7936.4176.0582.5674.22 P5.075.8885.508.5972.2280.738.3472.8681.166.8174.6180.4880.42 475130924888000 BINNOMSMARTBINNOMSMARTBINNOMSMARTBINNOMSMART C1.083.0741.0083.1682.9941.0582.9982.1841.3982.5282.0943.8582.26 C1.383.6340.7183.9883.7441.0583.5783.1141.3083.3782.3543.8182.56 C1.882.7740.6282.7783.2441.0984.3082.9441.3982.9482.4843.9882.69 C2.182.7340.6282.8283.1141.3083.8382.8641.4382.9982.3144.0782.65 C2.882.7140.6282.7983.0141.0983.6682.8041.3982.8882.1144.0181.54
18
Meta-classifier methods’ Non-adaptive method Majority Vote – each classifier votes a specific class for a current document Adaptive methods - Compute the similarity between a current sample and error samples from the self queue Selection based on Euclidean distance First good classifier The best classifier Selection based on cosine First good classifier The best classifier Using average
19
Selection based on Euclidean distance
20
Selection based on cosine
21
Comparison between SBED and SBCOS
23
Initial data set scalability Decision function Support vectors Representative vectors
24
Initial data set scalability Normalize each sample (7053) Group initial set based on distance (4474) Take relevant vector (4474) Use relevant vector in classification process Select only support vectors (847) Take samples grouped in selected support vectors (4256) Make the classification (with 4256 samples)
25
Initial data set scalability Normalize each sample (7053 samples) Group initial set based on distance (4474 groups) Take relevant vector (4474 vectors) Use relevant vector in classification process Select only support vectors (874 sp) Take samples grouped in selected support vectors (4256 samples) Make the classification (with 4256 samples)
26
Polynomial kernel – 1309 features, NOM
27
Gaussian kernel – 1309 features, CS
28
Training time
29
Choosing training and testing data set
31
Conclusions – other results Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel Reduced number of features between 2.5% (475) and 6% (1309) GA _FS faster than SVM_FS Polynomial kernel with nominal representation and small degree Gaussian kernel with Cornell Smart representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 % when the data set is reduced
32
Further work Features extraction and selection Association rules between words (Mutual Information) Synonym and Polysemy problem Using families of words (WordNet) Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering together
33
Steps for Classification Process Reuter’s databases Group Vectors of Documents Feature Selection SVM_FS Multi-class classification with Polynomial kernel degree 1 Reduced set of documents Feature Selection Random Information Gain SVM_FS GA_FS Multi-class classification with SVM Polynomial Kernel Gaussian Kernel Meta-classification accuracy Select only support vectors Classification accuracy One class classification with SVM Polynomial Kernel Gaussian Kernel Web pages Feature extraction Stop-words Stemming Document representation Clustering with SVM Polynomial Kernel Gaussian Kernel Meta-classification with SVM Non-adaptive method SBED SBCOS Adaptiv e methods
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.