Introduction to Generalised Low-Rank Model and Missing Values

Slides:

Advertisements

Similar presentations

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Advertisements

Aggregating local image descriptors into compact codes

Random Forest Predrag Radenković 3237/10

PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

« هو اللطیف » By : Atefe Malek. khatabi Spring 90.

A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.

Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.

Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Summarized by Soo-Jin Kim

Hyperspectral Imaging Alex Chen 1, Meiching Fong 1, Zhong Hu 1, Andrea Bertozzi 1, Jean-Michel Morel 2 1 Department of Mathematics, UCLA 2 ENS Cachan,

Basic concepts in ordination

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

Presentation to VII International Workshop on Advanced Computing and Analysis Techniques in Physics Research October, 2000.

Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.

Dimensionality Reduction Motivation I: Data Compression Machine Learning.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

Using Classification Trees to Decide News Popularity

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Logistic Regression & Elastic Net

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Scikit-Learn Intro to Data Science Presented by: Vishnu Karnam A

Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.

Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.

Distributed GLM Tom Kraljevic January 27, 2015 Atlanta, Polygon.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Image taken from: slideshare

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Machine Learning Library for Apache Ignite

Action-Grounded Push Affordance Bootstrapping of Unknown Objects

Deep learning David Kauchak CS158 – Fall 2016.

pycuda Jin Kwon Kim May 25, 2017 Hi my name is jin kwon kim.

Introduction to HDF5 Session Five Reading & Writing Raw Data Values

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Spark Presentation.

JZOS (Java Batch Launcher and Toolkit for z/OS) Quick Start Guide

Basic machine learning background with Python scikit-learn

Scalable Machine Learning For Smarter Applications

Machine Learning Feature Creation and Selection

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Paulo Gonçalves1 Hugo Carrão2 André Pinheiro2 Mário Caetano2

Lecture 23: Feature Selection

Data cleaning and transformation

Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.

Neural Networks in RStudio with the H2O module

Introduction to Apache

Unsupervised Classification

Analytics: Its More than Just Modeling

Parallelization of Sparse Coding & Dictionary Learning

Ying Dai Faculty of software and information science,

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Analysis for Predicting the Selling Price of Apartments Pratik Nikte

Introduction to Matlab

Level 1 Processing Pipeline

Wellington Cabrera Advisor: Carlos Ordonez

H2O is used by more than 14,000 companies

University of Wisconsin - Madison

The Student’s Guide to Apache Spark

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Lecture 16. Classification (II): Practical Considerations

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

An introduction to Machine Learning (ML)

Presentation transcript:

Introduction to Generalised Low-Rank Model and Missing Values Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus Based on work by Anqi Fu, Madeleine Udell, Corinne Horn, Reza Zadeh & Stephen Boyd.

About H2O.ai H2O in an open-source, distributed machine learning library written in Java with APIs in R, Python, Scala and REST/JSON. Produced by H2O.ai in Mountain View, CA. H2O.ai advisers are Trevor Hastie, Rob Tibshirani and Stephen Boyd from Stanford.

About Me 2005 - 2015 Water Engineer 2015 - Present Data Scientist Consultant for Utilities EngD Research 2015 - Present Data Scientist Virgin Media Domino Data Lab H2O.ai

About This Talk Overview of generalised low-rank model (GLRM). Four application examples: Basics. How to accelerate machine learning. How to visualise clusters. How to impute missing values. Q & A.

GLRM Overview GLRM is an extension of well-known matrix factorisation methods such as Principal Component Analysis (PCA). Unlike PCA which is limited to numerical data, GLRM can also handle categorical, ordinal and Boolean data. Given: Data table A with m rows and n columns Find: Compressed representation as numeric tables X and Y where k is a small user-specified number Y = archetypal features created from columns of A X = row of A in reduced feature space GLRM can approximately reconstruct A from product XY Memory Reduction / Saving ≈ +

GLRM Key Features Memory Speed Feature Engineering Compressing large data set with minimal loss in accuracy Speed Reduced dimensionality = short model training time Feature Engineering Condensed features can be analysed visually Missing Data Imputation Reconstructing data set will automatically impute missing values

GLRM Technical References Paper arxiv.org/abs/1410.0342 Other Resources H2O World Video Tutorials

Example 1: Motor Trend Car Road Tests “mtcars” dataset in R Original Data Table A m = 32

Example 1: Training a GLRM Check convergence

Example 1: X and Y from GLRM 3 11 3 Y X 32

Example 1: Summary Y A X ≈ Memory Reduction / Saving ≈ +

Example 2: ML Acceleration About the dataset R package “mlbench” Multi-spectral scanner image data 6k samples x1 to x36: predictors Classes: 6 levels Different type of soil Use GLRM to compress predictors

Example 2: Use GLRM to Speed Up ML k = 6 Reduce to 6 features

Example 2: Random Forest Train a vanilla H2O Random Forest model with … Full data set (36 predictors) Compressed data set (6 predictors)

Example 2: Results Comparison Data Time 10-fold Cross Validation Log Loss Accuracy Raw data (36 Predictors) 4 mins 26 sec 0.24553 91.80% Data compressed with GLRM (6 Predictors) 1 min 24 sec 0.25792 90.59% Benefits of GLRM Shorter training time Quick insight before running models on full data set

Example 3: Clusters Visualisation About the dataset Multi-spectral scanner image data Same as example 2 x1 to x36: predictors Use GLRM to compress predictors to 2D representation Use 6 classes to colour clusters

Example 3: Clusters Visualisation

Example 4: Imputation ”mtcars” – same dataset for example 1 Randomly introduce 50% missing values

Example 4: GLRM with NAs When we reconstruct the table using GLRM, missing values are automatically imputed.

Example 4: Results Comparison Absolute difference between original and imputed values. We are asking GLRM to do a difficult job 50% missing values Imputation results look reasonable

Conclusions Use GLRM to A great tool for data pre-processing Save memory Speed up machine learning Visualise clusters Impute missing values A great tool for data pre-processing Include it in your data pipeline

Any Questions? Contact Slides & Code H2O in London joe@h2o.ai @matlabulous github.com/woobe Slides & Code github.com/h2oai/h2o-meetups H2O in London Meetups / Office (soon) www.h2o.ai/careers H2O Help Docs & Tutorials www.h2o.ai/docs university.h2o.ai