Download presentation
Published byPedro Soto Modified over 9 years ago
1
scikit-learn Machine Learning in Python Vandana Bachani
Spring 2012
2
Outline What is scikit-learn? How can it be useful to the lab?
There are other packages too! Features Usage Conclusion
3
What is scikit-learn? scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib) A comprehensive package for all machine learning needs. Faster Accuracy? If you have the right data, it is pretty loyal. Ref:
4
Ref: http://scikit-learn.org/stable/
5
How can it be useful to the lab?
Our daily jobs: Regression/Prediction Text Classification Text Feature Extraction Text Feature Selection Using Chi-Square and other metrics Cross-Validation K-Fold Clustering (K-Means, etc.) Maybe in future: Image Classification All in one package!
6
There are other packages too!
NLTK Orange scikit-learn Machine Learning + Text Processing + … Machine Learning + visualizations Machine Learning + Machine Learning Mature (Book exists!) Naïve and sophisticated New, Still developing Documentation – Not so great. Good. Sufficient code examples. Documentation – Very good, but incomplete Lacks in functionality (w.r.t ML), old school Lacks lot of functionality (unsupervised learning) Almost complete w.r.t. machine learning + additional utilities Good Metrics Support Complicated to use Easy to use Easy and intuitive to use Rest API No API support
7
Features Linear Models Regression (Predicting Continuous Values)
Example: Prices of houses (Boston house dataset) Linear, Ridge, Lasso (for sparse coefficients, useful in field of compressed sensing), LARS (very-high dimensional data), Bayesian Classification Logistic Regression, Stochastic Gradient Descent
8
Features Support Vector Machines Classification Regression
SVC (one-vs-one), LinearSVC (one-vs-rest) Regression SVR Density Detection & Outlier Detection (unsupervised learning)
9
Features Unsupervised Learning Clustering Manifold Learning
K-Means, Mean Shift, Spectral Clustering Ward (hierarchical, constructs tree) Manifold Learning Dimensionality Reduction (for visualization, etc) Novelty and Outlier Detection Uses SVM
10
Features Miscellaneous Nearest neighbors Decision Trees
Unsupervised, Classification Decision Trees Classification, Regression Gaussian Processes Regression Metrics metrics.roc_curve(y_true, y_score) metrics.precision_recall_fscore_support(...) joblib and pickle
11
Features Cross-Validation Datasets Feature Extraction
cross_validation.KFold(n, k[, indices]) Datasets Feature Extraction Text feature_extraction.text.WordNGramAnalyzer([...]) feature_extraction.text.CharNGramAnalyzer([...]) Image feature_extraction.image.extract_patches_2d(...) Feature Selection feature_selection.chi2(X, y) feature_selection.SelectKBest(score_func[, k])
12
Usage Linear Regression Classification
>>> from sklearn import linear_model >>> clf = linear_model.LinearRegression() >>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) LinearRegression(copy_X=True, fit_intercept=True, normalize=False) Classification >>> from sklearn.linear_model import SGDClassifier >>> X = [[0., 0.], [1., 1.]] >>> y = [0, 1] >>> clf = SGDClassifier(loss="hinge", penalty="l2") >>> clf.fit(X, y) SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False, verbose=0)
13
Usage SVC & Cross-Validation >>> from sklearn import datasets
>>> from sklearn import svm >>> from sklearn import cross_validation >>> iris = datasets.load_iris() >>> clf = svm.SVC(kernel='linear') >>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5) >>> scores array([ , , , , ])
14
Sample Code penalty = "l2" #LinearSVC can be tried with L1, L2 penalties print "LinearSVC" linearSVC = LinearSVC(loss='l2', penalty=penalty, C=1000, dual=False, tol=1e-3) classify(linearSVC, X_train, y_train, X_test, y_test) #SGDClassifier print "SGDClassifier" sgdClf = SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty) classify(sgdClf, X_train, y_train, X_test, y_test) print "NaiveBayes - Multinomial" bernoulliNBClf = BernoulliNB(alpha=.01) classify(bernoulliNBClf, X_train, y_train, X_test, y_test) def classify(clf, X_train, y_train, X_test, y_test): clf.fit(X_train, y_train) train_time = time() - t0 print "train time: %0.3fs" % train_time pred = clf.predict(X_test) test_time = time() - t0 print "test time: %0.3fs" % test_time print "classification report:" print metrics.classification_report(y_test, pred, target_names=categories) data_train, data_test = trainData.data, testData.data y_train, y_test = trainData.target, testData.target print "Extracting features from the training dataset" #can use a specific analyzer to be passed to vectorizer #by default WordNGramAnalyzer is used vectorizer = Vectorizer() X_train = vectorizer.fit_transform(data_train) print "done in %fs" % (time() - t0) print "n_samples: %d, n_features: %d" % X_train.shape print "Extracting features from the test dataset" X_test = vectorizer.transform(data_test) print "n_samples: %d, n_features: %d" % X_test.shape
15
Sample Results SGDClassifier train time: 1.505s test time: 0.023s classification report: precision recall f1-score support TECHNOLOGY IDIOMS POLITICAL MUSIC GAMES SPORTS MOVIES CELEBRITY avg / total
16
Conclusion If you are a python person - Good documentation wins!
Seems like a good library NLTK + scikit-learn should make an excellent pair for our lab Good documentation wins!
17
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.