Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Classifying and clustering using Support Vector Machine 2 nd PhD report PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Suppervisor:

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

ECG Signal processing (2)

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Linear Classifiers (perceptrons)

Support Vector Machines

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

IMAN SAUDY UMUT OGUR NORBERT KISS GEORGE TEPES-NICA BARLEY SEEDS CLASSIFICATION.

Face Recognition & Biometric Systems Support Vector Machines (part 2)

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

Classification and Decision Boundaries

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

K nearest neighbor and Rocchio algorithm

Text Classification With Support Vector Machines

Reduced Support Vector Machine

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Efficient Model Selection for Support Vector Machines

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

CS 478 – Tools for Machine Learning and Data Mining SVM.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Weka. Weka A Java-based machine vlearning tool Implements numerous classifiers and other ML algorithms Uses a common.

A Genetic Algorithm-Based Approach to Content-Based Image Retrieval Bo-Yen Wang( 王博彥 )

Novel Approaches to Optimised Self-configuration in High Performance Multiple Experts M.C. Fairhurst and S. Hoque University of Kent UK A.F. R. Rahman.

SVMs in a Nutshell.

Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.

Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Support Vector Machine

An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.

Face Detection EE368 Final Project Group 14 Ping Hsin Lee

Instance Based Learning

An Introduction to Support Vector Machines

Pawan Lingras and Cory Butz

Behrouz Minaei, William Punch

Minimax Probability Machine (MPM)

COSC 4335: Other Classification Techniques

Presentation transcript:

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006

Contents Prerequisites Correlation of the SVM kernel’s parameters Polynomial kernel Gaussian kernel Feature selection using Genetic Algorithms Chromosome encoding Genetic operators Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods Selection based on Euclidean distance Selection based on cosine Initial data set scalability Choosing training and testing data sets Conclusions and further work

Prerequisites Reuters Database Processing total documents, 126 topics, 366 regions, 870 industry codes Industry category selection – “ system software ” 7083 documents (4722 training /2361 testing) attributes (features) 24 classes (topics) Data representation Binary Nominal Cornell SMART Classifier using Support Vector Machine techniques kernels

Correlation of the SVM kernel’s parameters Polynomial kernel Gaussian kernel

Polynomial kernel Commonly used kernel d – degree of the kernel b – the offset Our suggestion b = 2 * d Polynomial kernel parameter’s correlation

Bias – Polynomial kernel

Gaussian kernel parameter’s correlation Gaussian kernel Commonly used kernel C – usually represents the dimension of the set Our suggestion n – numbers of distinct features greater than 0

n – Gaussian kernel auto

Feature selection using Genetic Algorithms Chromosome Fitness (c i ) = SVM (c i ) Methods of selecting parents Roulette Wheel Gaussian selection Genetic operators Selection Mutation Crossover

Methods of selecting the parents Roulette Wheel each individual is represented by a space that corresponds proportionally to its fitness Gaussian : maxim value (m=1) and dispersion (σ = 0.4)

The process of obtaining the next generation

GA_FS versus SVM_FS for 1309 features

Training time, polynomial kernel, d= 2, NOM

GA_FS versus SVM_FS for 1309 features

Training time, Gaussian kernel, C=1.3, BIN

Meta-classifier with SVM Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart Upper limit (94.21%)

Meta Classifier selection BINNOMSMARTBINNOMSMARTBINNOMSMARTBINNOMSMART P P P P P BINNOMSMARTBINNOMSMARTBINNOMSMARTBINNOMSMART C C C C C

Meta-classifier methods’ Non-adaptive method Majority Vote – each classifier votes a specific class for a current document Adaptive methods - Compute the similarity between a current sample and error samples from the self queue Selection based on Euclidean distance First good classifier The best classifier Selection based on cosine First good classifier The best classifier Using average

Selection based on Euclidean distance

Selection based on cosine

Comparison between SBED and SBCOS

Initial data set scalability Decision function Support vectors Representative vectors

Initial data set scalability  Normalize each sample (7053)  Group initial set based on distance (4474)  Take relevant vector (4474)  Use relevant vector in classification process  Select only support vectors (847) Take samples grouped in selected support vectors (4256) Make the classification (with 4256 samples)

Initial data set scalability Normalize each sample (7053 samples) Group initial set based on distance (4474 groups) Take relevant vector (4474 vectors) Use relevant vector in classification process Select only support vectors (874 sp) Take samples grouped in selected support vectors (4256 samples) Make the classification (with 4256 samples)

Polynomial kernel – 1309 features, NOM

Gaussian kernel – 1309 features, CS

Training time

Choosing training and testing data set

Conclusions – other results Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel Reduced number of features between 2.5% (475) and 6% (1309) GA _FS faster than SVM_FS Polynomial kernel with nominal representation and small degree Gaussian kernel with Cornell Smart representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 % when the data set is reduced

Further work Features extraction and selection Association rules between words (Mutual Information) Synonym and Polysemy problem Using families of words (WordNet) Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering together

Steps for Classification Process Reuter’s databases Group Vectors of Documents Feature Selection SVM_FS Multi-class classification with Polynomial kernel degree 1 Reduced set of documents Feature Selection Random Information Gain SVM_FS GA_FS Multi-class classification with SVM Polynomial Kernel Gaussian Kernel Meta-classification accuracy Select only support vectors Classification accuracy One class classification with SVM Polynomial Kernel Gaussian Kernel Web pages Feature extraction Stop-words Stemming Document representation Clustering with SVM Polynomial Kernel Gaussian Kernel Meta-classification with SVM Non-adaptive method SBED SBCOS Adaptiv e methods