Getting Python To Learn From Only Parts Of Your Data

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Chapter 10 Curve Fitting and Regression Analysis
Model Assessment and Selection
Model assessment and cross-validation - overview
Sociology 601 Class 17: October 28, 2009 Review (linear regression) –new terms and concepts –assumptions –reading regression computer outputs Correlation.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
9. SIMPLE LINEAR REGESSION AND CORRELATION
Data Freshman Clinic II. Overview n Populations and Samples n Presentation n Tables and Figures n Central Tendency n Variability n Confidence Intervals.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Short Term Load Forecasting with Expert Fuzzy-Logic System
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Correlation & Regression
Correlation and Linear Regression
LINEAR REGRESSION Introduction Section 0 Lecture 1 Slide 1 Lecture 5 Slide 1 INTRODUCTION TO Modern Physics PHYX 2710 Fall 2004 Intermediate 3870 Fall.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Chapter 2 Research in Abnormal Psychology. Slide 2 Research in Abnormal Psychology  Clinical researchers face certain challenges that make their investigations.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Simple Linear Regression. The term linear regression implies that  Y|x is linearly related to x by the population regression equation  Y|x =  +  x.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
1 Estimating the Term Structure of Interest Rates for Thai Government Bonds: A B-Spline Approach Kant Thamchamrassri February 5, 2006 Nonparametric Econometrics.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Robust data filtering in wind power systems
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Math 4030 – 11b Method of Least Squares. Model: Dependent (response) Variable Independent (control) Variable Random Error Objectives: Find (estimated)
Data Analysis, Presentation, and Statistics
Data analysis tools Subrata Mitra and Jason Rahman.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Economics 173 Business Statistics Lecture 18 Fall, 2001 Professor J. Petry
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
3-1Forecasting Weighted Moving Average Formula w t = weight given to time period “t” occurrence (weights must add to one) The formula for the moving average.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Estimating standard error using bootstrap
Inference about the slope parameter and correlation
Is My Model Valid? Using Simulation to Understand Your Model and If It Can Accurately Predict Events Brad Foulkes JMP Discovery Summit 2016.
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
CHAPTER 7 Linear Correlation & Regression Methods
Data Analysis Module: Correlation and Regression
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Multiple Regression.
Regression Computer Print Out
What is Regression Analysis?
BA 275 Quantitative Business Methods
Stretching springs and Hooke’s Law
Multiple Decision Trees ISQS7342
Linear Regression and Correlation
Linear Regression and Correlation
Techniques for the Computing-Capable Statistician
Real-time Uncertainty Output for MBES Systems
Introduction to Machine learning
Bootstrapping and Bootstrapping Regression Models
Machine Learning in Business John C. Hull
Presentation transcript:

Getting Python To Learn From Only Parts Of Your Data Ami Tavory Final, Israel PyConTW 2013

Outline Introduction Model Selection And Cross Validation Model Assessment And The Bootstrap Conclusions

Prevalence Of Computing & Python - Two Implications General computing: Tasks & calculations unlimited by human limitations Python (& high-level languages): Low bar for new-area experimentation In the context of scientific Python: a challenge and an opportunity As time goes by...

Low Bar For Experimentation & Model Selection Web developer exa\mple Site revenues: Revenues seem inverse to clicks-to-purchase & page load time. Can we quantify using SciPy? Polyfit? linear regression (unconstrained, ridge, lasso, elastic net, total), random forests? etc. Clicks To Purchase Page Load Time (secs.) Daily Revenue (K $) 3 1 20 2 0.2 400 0.4 40 0.1 500 10

Computing Power & Model Assessment (Another web developer example) “The XYZ framework processes requests in under 2.5 msec” Transaction processing time (msecs.): Without computers, we have classical statistics: With high probability, the mean is between 1.89 and 4.63. No need for low-computation statistics. We can, e.g., assess the median and its confidence. 1.71 2.55 12.23 3.42 2.58 0.03 2.46 0.01 2.56 1.17 1.46 1.22 1.51 3.6 1.9 23.99 1.12 2.73 2.21 1.81 2.22 2.63 8.13 2.23 0.00 2.0 2.34 3.2 3.4

This Talk Two problems, common solution: Getting Python To Learn From Only Parts Of Your Data Talk will use: no math short NumPy / SciPy / SciKits snippets

Outline Introduction Model Selection And Cross Validation Model Assessment And The Bootstrap Conclusions

Example Domain – Hooke's Law Basic experiment with springs. Hooke's Law (Force proportional to displacement) F ~ x (F = force, x = displacement) >>> from numpy import * >>> from scipy import * >>> from matplotlib.pyplot import * >>> >>> # measure 8 displacements >>> x = arange(1, 8) >>> # note measurement errors >>> F = x + 3 * random.randn(8) >>> plot(x, F, 'bo', label = 'Experimental') [<matplotlib.lines.Line2D object at 0x31c5f90>] >>> show()

Example Domain – Hooke's Law – Experiment Results F x

Example Domain – Hooke's Law – Experiment & Theoretical Results F F = x x

Example Predictor – Polynomial Fitting Polynomial fit (scipy.polyfit) >>> help(polyfit) polyfit(x, y, deg) Least squares polynomial fit. Fit a polynomial ``p(x) = p[0] * x**deg + ... + p[deg]`` of degree `deg` to points `(x, y)`. Returns a vector of coefficients `p` that minimises the squared error.

Example Predictor – Polynomial Fitting Applied Polynomial fit (scipy.polyfit) >>> x = arange(1, 8) >>> F = x + 3 * random.randn(8) >>> # Fit the data to a straight line. >>> print polyfit(x, F, 1) >>> [ 1.13 -0.23] >>> # F ~ 1.13 * x – 0.23

The Problem – Which Model? Polynomial fit (scipy.polyfit) >>> x = arange(1, 8) >>> F = x + 3 * random.randn(8) >>> # Fit the data to a straight line. >>> print polyfit(x, F, 1) >>> [ 1.13 -0.23] >>> # F ~ 1.13 * x – 0.23 Cheating! What Types Of Problems Are We Considering? Find model relating # clicks + load time / revenue

The Problem – Alternatives? Polynomial fit (scipy.polyfit) >>> x = arange(1, 8) >>> F = x + 3 * random.randn(8) >>> # Fit the data to a straight line. >>> print polyfit(x, F, 1) >>> [ 1.13 -0.23] >>> # F ~ 1.13 * x – 0.23 >>> x = arange(1, 8) >>> F = x + 3 * random.randn(8) >>> # Fit the data to a parabola. >>> print polyfit(x, F, 2) >>> [-0.18 2.56 -2.38] >>> # F ~ -0.18 * x^2 + 2.56 * x – 2.38 * x

The Problem – Alternatives In Depth – Straight Line F x

The Problem – Alternatives In Depth – Parabola F x

The Problem – Alternatives Comparison

The Problem – Alternatives Comparison

The Problem – Alternatives Comparison

The Solution – Observation Getting Python To Learn Fron Only Parts Of Your Data To assess models, treat some of the train data as test data

The Solution – Cross Validation

The Solution – Cross Validation

The Solution – Cross Validation

The Solution – Cross Validation On average, the linear model has lower error than the parabolic model

Let's See It In Code... >>> from sklearn.cross_validation import * >>> >>> for tr, te in KFold(8): ... print tr, te ... [0 1 2 3 4] [5 6 7] [0 1 4 5 6 7] [2 3] [2 3 4 5 6 7] [0 1] >>> Assume: >>> # build_model(train_x, train_F) builds a model from train_x, train_F >>> # find_err(model, test_x, test_F) evaluates model on test_x, test_F >>> err = 0 ... m = build_model(x[tr], F[tr]) ... err += find_err(m, x[te], F[te])

Outline Introduction Model Selection And Cross Validation Model Assessment And The Bootstrap Conclusions

Example Domain – Processing Time Transaction processing time (msecs.) of XYZ: What is the “typical” processing time? >>> t = [1.71, 2.55, 12.23, 3.42, 2.58, 0.03, 2.46, 0.01, 2.56, 1.17, 1.46, 1.22, 1.51, 3.6, 1.9, 23.99, 1.12, 2.73, 2.21, 1.81, 2.22, 2.73, 2.63, 8.13, 2.23, 0.00, 2.0, 2.34, 3.2, 1.9, 3.4]

Example Domain – Processing Time

Example Domain And Model - Mean The mean is 3.25 with confidence interval 0.7 But what about outliers?

Example Domain And Model – Trimmed Mean

Example Domain And Model – Trimmed Mean

The Problem – Model Assessment The trimmed mean is 2.27 But what is the confidence interval?

The Problem – Model Assessment 1.71 2.55 12.23 3.42 0.03 0.01 2.58 2.46 1.17 1.22 2.56 1.46 3.6 1.9 1.51 1.12 1.81 23.99 2.73 2.0 2.34 2.63 2.23 2.22 8.13 2.21 3.2 0.00 3.4 2.27

The Problem – Model Assessment 2.27

The Solution – Observation 1.71 2.55 12.23 3.42 0.03 0.01 2.58 2.46 1.17 1.22 2.56 1.46 3.6 1.9 1.51 1.12 1.81 23.99 2.73 2.0 2.34 2.63 2.23 2.22 8.13 2.21 3.2 0.00 3.4 2.27 Getting Python To Learn Fron Only Parts Of Your Data Create new datasets by random Sampling (with replacement)

The Solution – Bootstrapping 12.23 2.21 2.27 1.46 1.71 2.56 3.42 2.73 2.22 8.13 2.55 1.22 1.17 2.58 0.03 3.6 0.00 1.51 23.99 23.99 2.46 3.2 1.9 1.9 0.01 2.63 2.46 2.34 2.23 3.4 3.4 2.58 2.73 2.0 2.58 1.12 2.34 1.81 2.56 2.73 1.51 2.46 1.12 1.81 1.51 12.23 12.23 1.9 2.56 3.6 23.99 1.51 1.51 2.58 2.56 3.2 3.6

The Solution – Bootstrapping 1.71 2.55 12.23 3.42 0.03 0.01 2.58 2.46 1.17 1.22 2.56 1.46 3.6 1.9 1.51 1.12 1.81 23.99 2.73 2.0 2.34 2.63 2.23 2.22 8.13 2.21 3.2 0.00 3.4 2.27

The Solution – Bootstrapping 2.27

Let's See It In Code... >>> from scipy.stats.mstats import * >>> from scikits.bootstrap.bootstrap import * >>> >>> t = [1.71, 2.55, 12.23, 3.42, 2.58, 0.03, 2.46, 0.01, 2.56, 1.17, 1.46, 1.22, 1.51, 3.6, 1.9, 23.99, 1.12, 2.73, 2.21, 1.81, 2.22, 2.73, 2.63, 8.13, 2.23, 0.00, 2.0, 2.34, 3.2, 1.9, 3.4] >>> print mean(t), trimmed_mean(t) 3.25967741935 2.2664 >>> print ci(t, trimmed_mean) [ 1.7604 3.2816]

Outline Introduction Model Selection And Cross Validation Model Assessment And The Bootstrap Conclusions

Conclusions Computers & Python availability implies that: We are faced with varied, configurable prediction techniques We are unconstrained by pen-and-paper-only statistical methods We can use Python's scientific libraries to address this: Cross validation Bootstrapping

Thanks! Thank you for your time!