An update on scikit-learn

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

scikit-learn Machine Learning in Python Vandana Bachani

Scikit-learn: Machine learning in Python

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Ensemble Learning (2), Tree and Forest

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

STUDENTLIFE PREDICTIVE MODELING Hongyu Chen Jing Li Mubing Li CS69/169 Mobile Health March 2015.

 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.

Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.

I'm thinking of a number. 12 is a factor of my number. What other factors MUST my number have?

Gap filling of eddy fluxes with artificial neural networks

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Data analysis tools Subrata Mitra and Jason Rahman.

Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Python Lab Matplotlib - I Proteomics Informatics, Spring 2014 Week 9 25 th Mar, 2014

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Machine Learning with Spark MLlib

Building Machine Learning System with Python

Machine Learning for Data Certification at CMS

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

Trees, bagging, boosting, and stacking

Chapter 6 Classification and Prediction

TeamMember1 TeamMember2 Machine Learning Project 2016/2017-2

Boosting and Additive Trees

Data Mining 101 with Scikit-Learn

Basic machine learning background with Python scikit-learn

Prepared by Kimberly Sayre and Jinbo Bi

A brief introduction to neural network

Machine Learning Week 1.

CIKM Competition 2014 Second Place Solution

Classifying enterprises by economic activity

CMPT 733, SPRING 2016 Jiannan Wang

CIKM Competition 2014 Second Place Solution

Balance Scale Data Set This data set was generated to model psychological experimental results. Each example is classified as having the balance scale.

Brief Intro to Python for Statistics

Machine Learning with Weka

Application of Logistic Regression Model to Titanic Data

The Assistive System Progress Report 2 Shifali Kumar Bishwo Gurung

Implementing AdaBoost

Analytics: Its More than Just Modeling

An Experimental Study of the Potential of Using Small

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Recitation #2 Tel Aviv University 2017/2018 Slava Novgorodov

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Recitation #2 Tel Aviv University 2016/2017 Slava Novgorodov

Dr. Sampath Jayarathna Cal Poly Pomona

Python for Data Analysis

OHDSI Gold Standard Phenotype Library Working Group

CMPT 733, SPRING 2017 Jiannan Wang

The Student’s Guide to Apache Spark

Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah

Stock Predictions Project Presentation

Igor Stančin, Alan Jović to: {igor.stancin,

Machine Learning in Business John C. Hull

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Machine Learning for Cyber

Machine Learning and Its Applications in Molecular Biophysics Jacob Andrzejczyk and Harish Vashisth Department of Chemical Engineering, University of New.

An introduction to Machine Learning (ML)

An introduction to neural network and machine learning

Machine Learning for Cyber

Machine Learning for Cyber

Machine Learning for Cyber

Presentation transcript:

An update on scikit-learn 0.20 and beyond Andreas Müller Columbia University, scikit-learn

What is scikit-learn?

Basic API review

X = y = Representing Data one sample outputs / labels one feature 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.5 8.1 3.6 4.6 5.1 9.7 7.9 3.7 7.8 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 X = y = outputs / labels one feature

clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Model Training Labels Test Data Prediction y_pred = clf.predict(X_test) Test Labels Evaluation clf.score(X_test, y_test)

Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression Dimensionality reduction Clustering Feature selection Feature extraction

Cross -Validated Grid Search from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) param_grid = {'C': 10. ** np.arange(-3, 3), 'gamma': 10. ** np.arange(-3, 3)} grid = GridSearchCV(SVC(), param_grid=param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) y_pred = grid.predict(X_test)

Pipelines

Pipelines from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) pipe.fit(X_train, y_train) pipe.predict(X_test) param_grid = {'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)} pipe = make_pipeline(StandardScaler(), SVC()) grid = GridSearchCV(pipe, param_grid=param_grid, cv=5) grid.fit(X_train, y_train)

0.20.0

OneHotEncoder for Strings

ColumnTransformer numeric_features = ['age', 'fare'] numeric_transformer = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())]) categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant', fill_value='missing'), OneHotEncoder()) preprocessor = make_column_transformer( (numeric_features, numeric_transformer), (categorical_features, categorical_transformer), remainder='drop') clf = make_pipeline(preprocessor, LogisticRegression())

PowerTransformer

Missing Value treatment Scalers SimpleImputer MissingIndicator

TransformedTargetRegressor

OpenML Dataset Loader

Loky A Robust and reusable Executor https://loky.readthedocs.io/en/stable/ An alternative for multiprocessing.pool.Pool and concurrent.futures.ProcessPoolExecutor No need for if __name__ == "__main__": in scripts Deadlock free implementation Consistent spawn behavior No random crashes with odd BLAS / OpenMP libraries

Global config and working memory

Gradient Boosting Early Stopping Activated by “n_iter_no_change”

SGD et al early stopping

Better defaults All random forests: n_estimators from 10 to 100 (in 0.22) Cross-validation: cv from 3 to 5 (in 0.22) Grid-Search: iid to False (in 0.22) Remove iid (in 0.24)

LogisticRegression defaults solver=’lbfgs’ (from ‘liblinear’) multiclass=’auto’ (from ‘ovr’)

A note on deprecations ...

Matplotlib based tree plotting Graphviz Our own!

Work In Progress So now, lets talk to the somewhat further-out future. What are the plans and ideas we have?

A (draft) Roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018

Dropping Python 2.7 (and 3.4!)

A new __repr__ Old new

A new __repr__ compared to old:

Loky / OpenMP

Imbalanced-learn integration

Pandas & Feature Names

amueller.github.io @amuellerml @amueller t3kcit @gmail.com Alright, buy my book. You can also find a lot of free videos and notebooks about machine learning and scikit-learn on my website. All the code from the book is BSD-licensed and on github. Thank you for your attention and I’m happy to take any questions!