Lab 1 Getting started with CLOP and the Spider package.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Oct 14, 2014 Lirong Xia Recommender systems acknowledgment: Li Zhang, UCSC.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Model Selection and Assessment Using Cross-indexing Juha Reunanen ABB, Web Imaging Systems, Finland.
Machine Learning Reading: Chapter 18, Agenda and Announcements Machine Learning assignment will go out on Thursday. Tutorial in class on tool for.
Reduced Support Vector Machine
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Data Mining – Intro.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon Amir Reza Saffari Azar Alamdari Gideon Dror.
RESULTS OF THE NIPS 2006 MODEL SELECTION GAME Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon, and many other volunteers, see
CLOP A MATLAB® learning object package
Masquerade Detection Mark Stamp 1Masquerade Detection.
Baseline Methods for the Feature Extraction Class Isabelle Guyon Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000.
This week: overview on pattern recognition (related to machine learning)
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Participation in the NIPS 2003 Challenge Theodor Mader ETH Zurich, Five Datasets were provided for experiments: ARCENE: cancer diagnosis.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.
Challenge Submissions for the Feature Extraction Class Georg Schneider my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice George Forman Martin Scholz Shyam.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Filter + Support Vector Machine for NIPS 2003 Challenge Jiwen Li University of Zurich Department of Informatics The NIPS 2003 challenge was organized to.
Object Recognition in Images Slides originally created by Bernd Heisele.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
PSMS for Neural Networks on the Agnostic vs Prior Knowledge Challenge Hugo Jair Escalante, Manuel Montes and Enrique Sucar Computer Science Department.
Handwritten digit recognition
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Consensus Group Stable Feature Selection
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Support Vector Machines Tao Department of computer science University of Illinois.
يادگيري ماشين Machine Learning Lecturer: A. Rabiee
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Data Mining and Decision Support
1 End-to-End Learning for Automatic Cell Phenotyping Paolo Emilio Barbano, Koray Kavukcuoglu, Marco Scoffier, Yann LeCun April 26, 2006.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Big data classification using neural network
Data Mining – Intro.
Evaluating Classifiers
KDD CUP 2001 Task 1: Thrombin Jie Cheng (
School of Computer Science & Engineering
Learning with information of features
iSRD Spam Review Detection with Imbalanced Data Distributions
Nearest Neighbors CSC 576: Data Mining.
MILESTONE RESULTS Mar. 1st, 2007
CAMCOS Report Day December 9th, 2015 San Jose State University
Lecture 16. Classification (II): Practical Considerations
Modeling IDS using hybrid intelligent systems
An introduction to Machine Learning (ML)
Presentation transcript:

Lab 1 Getting started with CLOP and the Spider package

CLOP Tutorial CLOP=Challenge Learning Object Package. Based on the Spider developed at the Max Planck Institute. Two basic abstractions: –Data object –Model object

CLOP Data Objects  cd  use_spider_clop;  X=rand(10,8);  Y=[ ]';  D=data(X,Y); % constructor  [p,n]=get_dim(D)  get_x(D)  get_y(D) At the Matlab prompt:

CLOP Model Objects  model = kridge; % constructor  [resu, model] = train(model, D);  resu, model.W, model.b0  Yhat = D.X*model.W' + model.b0  testD = data(rand(3,8), [ ]');  tresu = test(model, testD);  balanced_errate(tresu.X, tresu.Y) D is a data object previously defined.

Hyperparameters and Chains  default(kridge)  hyper = {'degree=3', 'shrinkage=0.1'};  model = kridge(hyper);  model = chain({standardize,kridge(hyper)});  [resu, model] = train(model, D);  tresu = test(model, testD);  balanced_errate(tresu.X, tresu.Y) A model often has hyperparameters: Models can be chained:

Hyper-parameters ect/MFAQ.htmlhttp://clopinet.com/isabelle/Projects/modelsel ect/MFAQ.html Kernel methods: kridge and svc: k(x, y) = (coef0 + x  y) degree exp(-gamma ||x - y|| 2 ) k ij = k(x i, x j ) k ii  k ii + shrinkage Naïve Bayes: naive: none Neural network: neural units, shrinkage, maxiter Random Forest: rf (windows only) mtry

Lab 2 Getting started with the NIPS 2003 FS challenge

The Datasets Arcene: cancer vs. normal with mass- spectrometry analysis of blood serum. Dexter: filter texts about corporate acquisition from Reuters collection. Dorothea: predict which compounds bind to Thrombin from KDD cup Gisette: OCR digit “4” vs. digit “9” from NIST. Madelon: artificial data.

Data Preparation Preprocessing and scaling to numerical range 0 to 999 for continuous data and 0/1 for binary data. Probes: Addition of “random” features distributed similarly to the real features. Shuffling: Randomization of the order of the patterns and the features. Baseline error rates (errate): Training and testing on various data splits with simple methods. Test set size: Number of test examples needed using rule-of-thumb n test = 100/errate.

Data Statistics Dataset SizeTypeFeatures Training Examples Validation Examples Test Examples Arcene 8.7 MB Dense Gisette 22.5 MB Dense Dexter 0.9 MB Sparse integer Dorothea 4.7 MB Sparse binary Madelon 2.9 MB Dense

ARCENE Sources: National Cancer Institute (NCI) and Eastern Virginia Medical School (EVMS). Three datasets: 1 ovarian cancer, 2 prostate cancer, all preprocessed similarly. Task: Separate cancer vs. normal. ARCENE is the cancer dataset

DEXTER Sources: Carnegie Group, Inc. and Reuters, Ltd. Preprocessing: Thorsten Joachims. Task: Filter “corporate acquisition” texts. DEXTER filters texts NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced that it has completed the acquisition of ProTrader Group, LP, a provider of advanced trading technologies and electronic brokerage services primarily for retail active traders and hedge funds. The acquisition excludes ProTrader’s proprietary trading business. ProTrader’s 2000 annual revenues exceeded $83 million.

DOROTHEA Sources: DuPont Pharmaceuticals Research Laboratories and KDD Cup Task: Predict compounds that bind to Thrombin. DOROTHEA is the Thrombin dataset

GISETTE Source: National Institute of Standards and Technologies (NIST). Preprocessing: Yann LeCun and collaborators. Task: Separate digits “4” and “9”. GISETTE contains handwritten digits

MADELON Source: Isabelle Guyon, inspired by Simon Perkins et al. Type of data: Clusters on the summits of a hypercube. MADELON is random data

Performance Measures Confusion matrix Balanced Error Rate (BER): the average of the error rates for each class: BER = 0.5*(b/(a+b) + c/(c+d)). Area Under Curve (AUC): the area under the ROC curve obtained by plotting a/(a+b) against d/(c+d) for each confidence value, starting at (0,1) and ending at (1,0). Fraction of Features (FF): the ratio of the num. of features selected to the total num. of features in the dataset. Fraction of Probes (FP): the ratio of the num. of “garbage features” (probes) selected to the total num. of feat. select.

BER distribution

Power of Feature Selection Best frac. feat Actual frac. probes ARCENE5%30% DEXTER1.5%50% DOROTHEA0.3%50% GISETTE18%50% MADELON1.6%96%

Visualization 1) Create a heatmap of the data matrix: show(D.train); 2) Look at individual patterns: browse(D.train); 3) Make a scatter plot of the 2 first features: show(D.train); 4) Visualize the results: [Dat,Model]=train(model, D.train); Dat=test(Model, D.valid); roc(Dat);

BER = f(threshold) Theta = Training set Theta = Test set No bias adjustment, test BER=22.54%; with bias, test BER=12.37% DOROTHEA

ROC curve AUC=0.91 Specificity Sensitivity DOROTHEA

Feature Selection MADELON (pval_max=0.1) rank W rank FDR

Heat map ARCENE

Scatter plots chain({standardize, s2n('f_max=2'), normalize, my_svc}) Test BER=49% chain({standardize, s2n('f_max=1100'), normalize, gs('f_max=2'), my_svc}) Test BER=29.37% ARCENE

Lab 3 Playing with FS filters and classifiers on Madelon and Dexter

Lab 3 software 1)Try the examples in Lab3 README.m 2)Inspiring your self by the examples, write a new feature ranking filter object. Choose one in Chapter 3 or invent your own. 3)Provide the pvalue and FDR (using a tabulated distribution or the probe method).

Filters: see chapter 3

(Use Matlab corrcoef. Gives the same results as Ttest, classes are (gives the same results as Ttest. Important for the pvalues: the Fisher criterion needs to be multiplied by num_patt_per_class or use (ranksum test)

Evalution of pval and FDR Ttest object: –computes pval analytically –FDR~pval*n sc /n probe object: –takes any feature ranking object as an argument (e.g. s2n, relief, Ttest) –pval~n sp /n p –FDR~pval*n sc /n

Analytic vs. probe rank FDR Arcene rank FDR Dexter x rank FDR Dorothea rank FDR Gisette rank FDR Madelon Red analytic – Blue probe

Relief vs. Ttest (Madelon) rank pval FDR Ttest Relief Ttest Relief rank

Lab 4 Plying with Feature construction on Gisette

Gisette Handwritten digits. Goal: get familiar with the data and result formats. Make a first submission. Easiest LM for Gisette: naive and svc. Best preprocessing: normalize. Easiest feature selection method: s2n. Many training examples (6000). Unsuitable for kridge unless subsampling is used. Many features (5000). Select features before running neural or rf.

Baseline Model baselineGisette (BER=1.8%, feat=20%) my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'}); my_model=chain({normalize, s2n('f_max=1000'), my_classif});

Baseline methods baselineGisette (CV=1.91%, test=1.80%, feat=20%) my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'}); my_model=chain({normalize, s2n('f_max=1000'), my_classif}); baselineGisette2 (CV=1.34%, test=1.17%, feat=20%) my_model=chain({s2n('f_max=1000'), normalize, my_classif}); pixelGisette (CV=1.31%, test=0.91%) my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'}); my_model=chain({normalize, my_classif});

Convolutions GISETTE (pixelGisette_exp_conv) prepro=my_model{1}; show(prepro.child); DD=test(prepro,D.train); browse_digit(DD.X, D.train.Y); chain({convolve(exp_ker({'dim1=9', 'dim2=9'})), normalize, my_classif})

Principal Filter bank object retaining the first f_max principal components of the data Filter bank containing templates corresponding to f_max cluster centers.

@hadamard_bank: Filter bank object performing a Hadamard transform. Hadamard bank

Fourier Two dimensional Fourier transform.

Epilogue Becoming a pro and playing with other datasets

Baseline Methods for the Feature Extraction Class Isabelle Guyon Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'}); my_model=chain({normalize, s2n('f_max=1000'), my_classif}); Best BER= 11.9  1.2 % - n0=1100 (11%) – BER0=14.7% ARCENE Best BER= 11.9  1.2 % - n0=1100 (11%) – BER0=14.7% my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'}); my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc}) BACKGROUND We present supplementary course material complementing the book “Feature Extraction, Fundamentals and Applications”, I. Guyon et al Eds., to appear in Springer. Classical algorithms of feature extraction were reviewed in class. More attention was given to the feature selection than feature construction because of the recent success of methods involving a large number of "low-level" features. The book includes the results of a NIPS 2003 feature selection challenge. The students learned techniques employed by the best challengers and tried to match the best performances. A Matlab® toolbox was provided with sample code.The students could makepost-challenge entries to: Challenge: Good performance + few features. Tasks: Two-class classification. Data split: Training/validation/test. Valid entry: Results on all 5 datasets. DATASETS Dataset Size MB TypeFeaturesTrainingValidationTest Arcene 8.7Dense Gisette 22.5Dense Dexter 0.9Sparse Dorothea 4.7Sparse Madelon 2.9Dense Best BER=6.22  0.57% - n0=20 (4%) – BER0=7.33% MADELON Best BER=6.22  0.57% - n0=20 (4%) – BER0=7.33% my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'}); my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif}) METHODS Scoring: Ranking according to test set balanced error rate (BER), i.e. the average positive class error rate and negative class error rate. Ties broken by the feature set size. Learning objects: CLOP learning objects implemented in Matlab®. Two simple abstractions: data and algorithm. Download: Task of the students: Baseline method provided, BER0 performance and n0 features. Get BER<BER0 or BER=BER0 but n<n0. Extra credit for beating the best challenge entry. OK to use the validation set labels for training. Best BER=8.54  0.99% - n0=1000 (1%) – BER0=12.37% DOROTHEA Best BER=8.54  0.99% - n0=1000 (1%) – BER0=12.37% my_model=chain({TP('f_max=1000'), naive, bias}); Best BER=3.30  0.40% - n0=300 (1.5%) – BER0=5% DEXTER Best BER=3.30  0.40% - n0=300 (1.5%) – BER0=5% my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'}); my_model=chain({s2n('f_max=300'), normalize, my_classif}) ARCENE DEXTER DEXTER: text categorization GISETTE GISETTE: digit recognition DOROTHEA: drug discovery DOROTHEA RESULTS MADELON MADELON: artificial data CONCLUSIONS The performances of the challengers could be matched with the CLOP library. Simple filter methods (S2N and Relief) were sufficient to get a space dimensionality reduction comparable to what the winners obtained. SVMs are easy to use and generally work better than other methods. We experienced with Gisette to add prior knowledge about the task and could outperform the winners. Further work includes using prior knowledge for other datasets. NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha ARCENE: cancer diagnosis

Best student results

Agnostic Learning vs. Prior Knowledge challenge Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon, and many other volunteers, see Open until August 1 st, 2007

Datasets Dataset Domain Type Feat- ures Training Examples Validation Examples Test Examples ADA Marketing Dense GINA Digits Dense HIVA Drug discovery Dense NOVA Text classif. Sparse binary SYLVA Ecology Dense

ADA ADA is the marketing database Task: Discover high revenue people from census data. Two- class pb. Source: Census bureau, “Adult” database from the UCI machine- learning repository. Features: 14 original attributes including age, workclass, education, education, marital status, occupation, native country. Continuous, binary and categorical features.

GINA Task: Handwritten digit recognition. Separate the odd from the even digits. Two-class pb. with heterogeneous classes. Source: MNIST database formatted by LeCun and Cortes. Features: 28x28 pixel map. GINA is the digit database

HIVA HIVA is the HIV database Task: Find compounds active against the AIDS HIV infection. We brought it back to a two-class pb. (active vs. inactive), but provide the original labels (active, moderately active, and inactive). Data source: National Cancer Inst. Data representation: The compounds are represented by their 3d molecular structure.

NOVA NOVA is the text classification database Task: Classify newsgroup s into politics or religion vs. other topics. Source: The 20-Newsgroup dataset from in the UCI machine- learning repository. Data representation : The raw text with an estimated words of vocabulary. Subject: Re: Goalie masks Lines: 21 Tom Barrasso wore a great mask, one time, last season. He unveiled it at a game in Boston. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo. A great mask done in by a goalie's superstition. Lori

SYLVA SYLVA is the ecology database Task: Classify forest cover types into Ponderosa pine vs. everything else. Source: US Forest Service (USFS). Data representation: Forest cover type for 30 x 30 meter cells encoded with 108 features (elavation, hill shade, wilderness type, soil type, etc.)

BER distribution (March 1 st ) Agnostic learning Prior knowledge The black vertical line indicates the best ranked entry (only the 5 last entry of each participant were ranked). Beware of overfitting!

CLOP models

Preprocessing and FS

Model grouping for k=1:10 base_model{k}=chain({standardize, naive}); end my_model=ensemble(base_model);

CLOP models (best entrant) DatasetCLOP models selected ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias} GINA 6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)} HIVA 3*{norm,svc(degree=1),bias} NOVA 5*{norm,gentleboost(kridge),bias} SYLVA 4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias} Juha Reunanen, cross-indexing-7 sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)

CLOP models (2 nd best entrant) DatasetCLOP models selected ADA {sns, std, norm, neural(units=5), bias} GINA {norm, svc(degree=5, shrinkage=0.01), bias} HIVA {std, norm, gentleboost(kridge), bias} NOVA {norm,gentleboost(neural), bias} SYLVA {std, norm, neural(units=1), bias} Hugo Jair Escalante Balderas, BRun sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown) Note: entry Boosting_1_001_x900 gave better results, but was older.