An update on scikit-learn

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

scikit-learn Machine Learning in Python Vandana Bachani
Scikit-learn: Machine learning in Python
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Ensemble Learning (2), Tree and Forest
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
STUDENTLIFE PREDICTIVE MODELING Hongyu Chen Jing Li Mubing Li CS69/169 Mobile Health March 2015.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
I'm thinking of a number. 12 is a factor of my number. What other factors MUST my number have?
Gap filling of eddy fluxes with artificial neural networks
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Data analysis tools Subrata Mitra and Jason Rahman.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Python Lab Matplotlib - I Proteomics Informatics, Spring 2014 Week 9 25 th Mar, 2014
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A Smart Tool to Predict Salary Trends of H1-B Holders
Machine Learning with Spark MLlib
Building Machine Learning System with Python
Machine Learning for Data Certification at CMS
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Trees, bagging, boosting, and stacking
Chapter 6 Classification and Prediction
TeamMember1 TeamMember2 Machine Learning Project 2016/2017-2
Boosting and Additive Trees
Data Mining 101 with Scikit-Learn
Basic machine learning background with Python scikit-learn
Prepared by Kimberly Sayre and Jinbo Bi
A brief introduction to neural network
Machine Learning Week 1.
CIKM Competition 2014 Second Place Solution
Classifying enterprises by economic activity
CMPT 733, SPRING 2016 Jiannan Wang
CIKM Competition 2014 Second Place Solution
Balance Scale Data Set This data set was generated to model psychological experimental results. Each example is classified as having the balance scale.
Brief Intro to Python for Statistics
Machine Learning with Weka
Application of Logistic Regression Model to Titanic Data
The Assistive System Progress Report 2 Shifali Kumar Bishwo Gurung
Implementing AdaBoost
Analytics: Its More than Just Modeling
An Experimental Study of the Potential of Using Small
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Recitation #2 Tel Aviv University 2017/2018 Slava Novgorodov
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Recitation #2 Tel Aviv University 2016/2017 Slava Novgorodov
Dr. Sampath Jayarathna Cal Poly Pomona
Python for Data Analysis
OHDSI Gold Standard Phenotype Library Working Group
CMPT 733, SPRING 2017 Jiannan Wang
The Student’s Guide to Apache Spark
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Stock Predictions Project Presentation
Igor Stančin, Alan Jović to: {igor.stancin,
Machine Learning in Business John C. Hull
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Machine Learning and Its Applications in Molecular Biophysics Jacob Andrzejczyk and Harish Vashisth Department of Chemical Engineering, University of New.
An introduction to Machine Learning (ML)
An introduction to neural network and machine learning
Machine Learning for Cyber
Machine Learning for Cyber
Machine Learning for Cyber
Presentation transcript:

An update on scikit-learn 0.20 and beyond Andreas Müller Columbia University, scikit-learn

What is scikit-learn?

Basic API review

X = y = Representing Data one sample outputs / labels one feature 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.5 8.1 3.6 4.6 5.1 9.7 7.9 3.7 7.8 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 X = y = outputs / labels one feature

clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Model Training Labels Test Data Prediction y_pred = clf.predict(X_test) Test Labels Evaluation clf.score(X_test, y_test)

Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression Dimensionality reduction Clustering Feature selection Feature extraction

Cross -Validated Grid Search from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) param_grid = {'C': 10. ** np.arange(-3, 3), 'gamma': 10. ** np.arange(-3, 3)} grid = GridSearchCV(SVC(), param_grid=param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) y_pred = grid.predict(X_test)

Pipelines

Pipelines from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) pipe.fit(X_train, y_train) pipe.predict(X_test) param_grid = {'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)} pipe = make_pipeline(StandardScaler(), SVC()) grid = GridSearchCV(pipe, param_grid=param_grid, cv=5) grid.fit(X_train, y_train)

0.20.0

OneHotEncoder for Strings

ColumnTransformer numeric_features = ['age', 'fare'] numeric_transformer = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())]) categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = make_pipeline( SimpleImputer(strategy='constant', fill_value='missing'), OneHotEncoder()) preprocessor = make_column_transformer( (numeric_features, numeric_transformer), (categorical_features, categorical_transformer), remainder='drop') clf = make_pipeline(preprocessor, LogisticRegression())

PowerTransformer

Missing Value treatment Scalers SimpleImputer MissingIndicator

TransformedTargetRegressor

OpenML Dataset Loader

Loky A Robust and reusable Executor https://loky.readthedocs.io/en/stable/ An alternative for multiprocessing.pool.Pool and concurrent.futures.ProcessPoolExecutor No need for if __name__ == "__main__": in scripts Deadlock free implementation Consistent spawn behavior No random crashes with odd BLAS / OpenMP libraries

Global config and working memory

Gradient Boosting Early Stopping Activated by “n_iter_no_change”

SGD et al early stopping

Better defaults All random forests: n_estimators from 10 to 100 (in 0.22) Cross-validation: cv from 3 to 5 (in 0.22) Grid-Search: iid to False (in 0.22) Remove iid (in 0.24)

LogisticRegression defaults solver=’lbfgs’ (from ‘liblinear’) multiclass=’auto’ (from ‘ovr’)

A note on deprecations ...

Matplotlib based tree plotting Graphviz Our own!

Work In Progress So now, lets talk to the somewhat further-out future. What are the plans and ideas we have?

A (draft) Roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018

Dropping Python 2.7 (and 3.4!)

A new __repr__ Old new

A new __repr__ compared to old:

Loky / OpenMP

Imbalanced-learn integration

Pandas & Feature Names

amueller.github.io @amuellerml @amueller t3kcit @gmail.com Alright, buy my book. You can also find a lot of free videos and notebooks about machine learning and scikit-learn on my website. All the code from the book is BSD-licensed and on github. Thank you for your attention and I’m happy to take any questions!