CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150.

CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150

Question Many learning algorithms, how to compare their effectiveness?

Evaluation Process Process:  Have two large datasets – training data set and a testing dataset (both representative of the underlying problem)  Run a wide variety of algorithms on the training dataset (different algorithms will produce different models, i.e. different patterns – try many algorithms so that if there is a hidden pattern it is likely that one algorithm will find it)  Test each model on the test dataset to see which performs best

Independent and Representative Data As long as the training and testing datasets are independent and representative of the underlying problem, it is likely that the performance predicted on the test dataset will match reality. That is, when the model is used on new data, the error rate will be the same as predicted.

How Much Data How much data is enough? Depends on:  Algorithms being used  Complexity of the data  Required success rate – in some cases the cost of misclassification is much more serious than in other cases  Relative frequency of possible outcomes

Statistics Statisticians have spent years developing tests for determining the smallest model set that can be used to produce an accurate model

Quote “Data mining is useful when the sheer volume of data obscures patterns that might be detectable in smaller databases. Generally start with tens of thousands, if not millions of pre- classified records. If data is scarce, data mining is unlikely to be useful.” From Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

Rare Class Values Target variables might represent something relatively rare:  Prospects responding to a direct mail offer  Credit card holders committing fraud  In a month, newspaper subscribers canceling their subscription

Recommendation The training set should be balanced with equal numbers of each of the outcomes. A smaller, balanced sample is preferable to a larger one with a very low proportion of rare outcomes

Limited Quality Data Quality data which is representative of the underlying problem, is hard to come by.

Satellite Imagery of Sea Ice

Electrical Load Forecasting Want to predict future demand for power as far in advance as possible With accurate predictions can fine tune: operating reserves maintenance scheduling, and fuel inventory management Data collected over 15 years Major holidays, such as Thanksgiving, Christmas and New year’s Day show significant variation from normal loads Page 24

Diagnosis of Electromechanical Failures Preventative maintenance of electromechanical devices such as motors and generators can forestall failures that disrupt industrial processes Had data with 600 faults, each comprising a set of measurements along with an expert’s diagnosis, representing 20 years of experience Half were unsatisfactory for various reasons and had to be discarded, the remainder used for training examples Page 25

Validation Data Set In many cases 3 data sets are needed:  Training data set for selecting the learning algorithm  Validation data set for setting parameters on the chosen learning algorithms  Testing data set for determining the accuracy

Maximizing Training Once error rate is measured, re-bundle all three datasets and train again – but don’t re-measure the error rate!

How Close to the True Success Rate? Toss a coin 100 times and get 75 heads Estimated success rate: f = 75/100 =.75 Toss a coin 1,000 times and get 750 heads Estimated success rate: f = 750/1000 =.75

Confidence Intervals Statistical theory provides us with confidence intervals for the true underlying proportion Toss a coin 100 times and get 75 heads Estimated success rate f = 75/100 =.75 With 80% confidence, true success rate in [69.1,80.1] Toss a coin 1,000 times and get 750 heads Estimated success rate f = 750/1000 =.75 With 80% confidence, true success rate in [73.2,76.7]

CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150.

Similar presentations

Presentation on theme: "CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150.

Similar presentations

Presentation on theme: "CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150."— Presentation transcript:

Similar presentations

About project

Feedback