Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen.

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

Pseudo Inverse Heisenberg Uncertainty for Data Mining Explicit Principal Components Implicit Principal Components NIPALS Algorithm for Eigenvalues and.
AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
R for Classification Jennifer Broughton Shimadzu Research Laboratory Manchester, UK 2 nd May 2013.
RStat: Release 1.2 Ali-Zain Rahim, Strategic Product Manager March 18, 2010.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Data Mining Techniques Outline
Direct Kernel Methods. Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible information.
RENSSELAER PLS: PARTIAL-LEAST SQUARES PLS: - Partial-Least Squares - Projection to Latent Structures - Please listen to Svante Wold Error Metrics Cross-Validation.
Ameriranikistan Muhammad Ahmad Kyle Huston Farhad Majdeteimouri Dan Mackin.
Data Mining with Neural Networks
Three kinds of learning
Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
An Extended Introduction to WEKA. Data Mining Process.
IS&T Scientific Visualization Tutorial – Spring 2010 Robert Putnam Plotting packages overview.
1 Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Detailed q2/Q2 results for 100 bootstraps for final runs with (38 + dummy features)
1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Ensemble Learning (2), Tree and Forest
Ranga Rodrigo April 5, 2014 Most of the sides are from the Matlab tutorial. 1.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
An Exercise in Machine Learning
Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
CSE 185 Introduction to Computer Vision Pattern Recognition.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Appendix: The WEKA Data Mining Software
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Cluster validation Integration ICES Bioinformatics.
ROOT ROOT.PAT ROOT.TES (ROOT.WGT) (ROOT.FWT) (ROOT.DBD) MetaNeural ROOT.XXX ROOT.TTT ROOT.TRN (ROOT.DBD) ROOT.WGT ROOT.FWT Use Analyze root –34 for easy.
An Exercise in Machine Learning
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Neural networks – Hands on
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Experience Report: System Log Analysis for Anomaly Detection
An Empirical Comparison of Supervised Learning Algorithms
Introduction to Data Mining
Demographics and Weblog Hackathon – Case Study
Waikato Environment for Knowledge Analysis
Intro to Machine Learning
Prediction of in-hospital mortality after ruptured abdominal aortic aneurysm repair using an artificial neural network  Eric S. Wise, MD, Kyle M. Hocking,
Intro to Machine Learning
Model generalization Brief summary of methods
Lecture 10 – Introduction to Weka
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen to Svante Wold) LOO, BOO and n-Fold Cross-Validation Error Measures Albumin Data Set and Feature Selection Bio-Informatics Analyze/StripMiner™

Feature Selection – Sensitivity Analysis – Genetic Algorithms – Correlation GA (GAFEAT) – Method specific Learning Modes – Bootstrapping – Bagging – Boosting – Leave-one-out cross-validation Data Processing – Interface with RECON – Different Scaling Modes – Outlier detection/data cleansing Visualization – Correlation Plots – 2-D Sensitivity Plots – Outlier Visualization Plots – Different Scaling Options – Cluster Ranking Plots – Standard ROC curves – Continuous ROC curves Modeling – ANN (Neural Networks) – SVM (Support Vector Machines) – PLS (Partial-Least Squares) – GA-based regression clustering – PCA regression – Local Learning – Outlier Detection (GAMOL) Code Specifics – Tight Classic C-code (< lines) – Script-Based Shell Program – Runs on all Platforms – Ultra Fast – Use: TransScan – GE - KODAK Doppler broadening Macro-Economics Analysis

Analyze/StripMiner ™ Coding Philosophy Standard C code that compiles on all platforms WINDOWS™ and Linux platforms Supporting visualizations use Java and/or gnuplot Flexible GUI with sample problems and demos Fastest code possible with efficient memory requirements Long history of code use with variety of users for troubleshooting Flexible code based on scripts and operators Operates on a numeric standard data mining format file

Practical Tips for PCA NIPALS algorithm assumes the features are zero centered It is standard practice to do a Mahalanobis scaling of the data PCA regression does not consider the response data The t’s are called the scores It is common practice to drop 4 sigma outlier features (if there are many features)

StripMiner Script Examples PCA visualization (pca.bat) Pharma-plot (pharma.bat) Prediction for iris with PCA (iris.bat) Bootstrap prediction for iris (iris_boo.bat) Predicting with an external test set example (iris_ext.bat)) PLS and ROC curve for iris problem (roc.bat) Leave-One-Out PLS for HIV (loo_hiv.bat) Feature selection for HIV (prune.bat) Starplots (star.bat)

File Flow for PCA.bat Script num_eg.txt contains the number of PCAs (2-10) usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only) num_eg.txt stats.txt la_sscala.txt iris.txt.txt.txt.txt

num_eg.txt has to contain a 4 for a pharmaplot use pharmaplot.m for visualization in MATLAB adjust color setting threshold in pharmaplot.m File Flow for pharma.bat script num_eg.txt stats.txt la_sscala.txt dmatrix.txt a.txt pharmaplot

For the random seed in splitting routine don’t use 0 (preserves order) The test set is really only for validation purposes (answer is known) Note: descaling from PLS uses la_sscala.txt file Notice q2, Q2, and RSME error measures File Flow For iris.bat Script: Predicting Class num_eg.txt stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.txt resultss.xxx resultss.ttt results.xxx results.ttt

We use bootstrap cross-validation (e.g., leave 7 out 100 times) Use MATLAB script dos_mbotw results.ttt to display results for test set Use MATLAB script dos_mbotw resultss.xxx to display results training set Notice q2, Q2, and RSME error measures File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence num_eg.txt stats.txt la_sscala.txt a.txt resultss.xxx resultss.ttt results.ttt

Error Measure Criteria For training set we use: - RMSE: root mean square error for training set - r 2 : correlation coefficient for training set - R 2 : PRESS R 2 For validation/test set we use: - RMSE: reast mean square error for validation set - q 2 : 1 – r test 2 - Q 2 : PRESS/SD

Script for Scaling with an External Test Set 3305 scatterplot (Java) scatterplot gnuplot 3313 errorplot (Java) errorplot (gnuplot)

Docking Ligands is a Nonlinear Problem

PLS, K-PLS, SVM, ANN Feature Selection (data strip mining)

Binding affinities to human serum albumin (HSA): log K’hsa Gonzalo Colmenarejo, GalaxoSmithKline J. Med. Chem. 2001, 44, molecules, descriptors 84 training, 10 testing (1 left out) 551 Wavelet + PEST + MOE descriptors Widely different compounds Acknowledgements: Sean Ekins (Concurrent) N. Sukumar (Rensselaer)

Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data cmatrix.ori dmatrix.ori num_eg.txt stats.txt la_sscala.txt a.txt results.xxx results.ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt PLS-LOO stands for leave-one-out PLS cross-validation Training set is in cmatrix.ori and external validation set in dmatrix.ori External validation set has –999 or 0 in the activity field Note that we create generic labels and and that there is a test set Notice the dropping of non-changing features and 4-sigma ouliers Notice the acrobatics for displaying metrics (visualize with dos_mbotw)

PLS Feature Selection Script For Albumin Data Do several iterative prunings, typically leave 7 out 100 x Use different seeds Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, … aa.pat bbmatrixx.txt sel_lbls.txt select.txt sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixxx.txt

STARPLOT.BAT: Starplot for Selected Features for Albumin sel_lbls.txt aa.pat bbmatrixxx.txt sel_lbls.txt starplot.txt starplot First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps using PLS bootstrap option 33 Generate starplot.txt from bbmatrixxx.txt using option 3320 Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)