Download presentation
Presentation is loading. Please wait.
Published byGrace Shepherd Modified over 9 years ago
1
Science in Business Data Mining? Background: support managerial decision making Background: support managerial decision making Is there a science to data mining (with CI-methods)? Is there a science to data mining (with CI-methods)?Outline 1.Data Mining in Business & Management 2.Rules established in Business practices vs. Data mining? 1.Statistics vs. Data driven modelling 2.A personal view 3.How do develop meta-knowledge Sven F. Crone, Lancaster University Management School Research Centre for Forecasting YES, but it depends (and it may be empirical Wizardry driven by efficiency rather than effectiveness!)
2
Business Data Mining? Main areas for Data Mining: Main areas for Data Mining: Finance: Credit risk (personal & corporate) Finance: Credit risk (personal & corporate) Marketing:Customer Relationship Management (=Direct Marketing, Database Marketing) Marketing:Customer Relationship Management (=Direct Marketing, Database Marketing) Sven F. Crone, Lancaster University Management School Research Centre for Forecasting Churn Prediction Credit Scoring Direct Marketing adapted from Berry and Linoff (2004) and Olafson et al (2006)
3
Best practices Credit Scoring Small & Balanced classes Small & Balanced classes Use 2000 of minority class Use undersampling Discretise all (!) variables Discretise all (!) variables Binary dummies / WOE to capture non-linearity Use Logistic regression Use Logistic regression Cross-Selling Large & imbalanced sample Large & imbalanced sample Use large sample sizes Original (Imbalanced) class distribution … A personal view: Data selection is best using prior domain knowledge (use filters) Data selection is best using prior domain knowledge (use filters) Pre-processing more important than method [Crone et al, 2006; Keogh 2002] Pre-processing more important than method [Crone et al, 2006; Keogh 2002] (Balanced) sampling & pre-processing is method dependent (Balanced) sampling & pre-processing is method dependent Best practices exist & are domain dependent (e.g. homogeneous datasets in credit scoring) Best practices exist & are domain dependent (e.g. homogeneous datasets in credit scoring) Flat Maximum effect [Lovie & Lovie, 1986] Flat Maximum effect [Lovie & Lovie, 1986] Sven F. Crone, Lancaster University Management School Research Centre for Forecasting GAP Extensive use of expert domain knowledge efficient solution ≠ best Practitioners & Consultants use statistics
4
How do derive (meta)-knowledge? Lessons from other disciplines: Time Series Forecasting Lessons from other disciplines: Time Series Forecasting More ‘Evidence based methods” [Armstrong 2000] More ‘Evidence based methods” [Armstrong 2000] Empirical Evidence Empirical Evidence Conditions under which methods perform well (multiple hypothesis) Conditions under which methods perform well (multiple hypothesis) Domain specific Competitions (valid & reliable) Domain specific Competitions (valid & reliable) Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple homogeneous datasets from one domain Multiple homogeneous datasets from one domain Use of valid benchmark methods & unbiased error measures Use of valid benchmark methods & unbiased error measures Honour the domain & decision context (active learning, cost sensitive) Honour the domain & decision context (active learning, cost sensitive) Replications Replications Studies must allow replications – document all steps / parameters Studies must allow replications – document all steps / parameters STOP FINE-TUNING / MARGINAL EXTENSION OF SINGLE METHOD ON SINGLE TOY DATASET Develop solutions for domain (Why make life harder?) Where to start? follow high impact approach! Where to start? follow high impact approach! Identify most prominent application domains (e.g. credit risk) Identify most prominent application domains (e.g. credit risk) Select promising application domains for CI-methods Select promising application domains for CI-methods Get corporate sponsor & run competition Get corporate sponsor & run competition Analyse conditions (!) using meta-studies! Analyse conditions (!) using meta-studies! Embed findings as methodology in SOFTWARE Embed findings as methodology in SOFTWARE Sven F. Crone, Lancaster University Management School Research Centre for Forecasting
5
Literature Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.