Science in Business Data Mining? Background: support managerial decision making Background: support managerial decision making Is there a science to data.

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

Credit Risk In A Model World
Chapter 1 Business Driven Technology
1 Why the damped trend works Everette S. Gardner, Jr. Eddie McKenzie.
Universität Hamburg Institut für Wirtschaftsinformatik Prof. Dr. D.B. Preßmar Final Results of the NN3 Neural Network Forecasting Competition Sven F. Crone,
25 September 2009 School of Economics 1 Information Session on Second Major in Applied Statistics (APS) Prepared by Kwong Koon Shing.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Bootstrapping judgemental adjustments to improve forecasting accuracy - judgemental bootstraps vs error bootstraps Robert Fildes Centre for Forecasting,
Chapter 8 – Logistic Regression
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
Chapter 4 Validity.
Data Consultancy Practice Amit Vakil Director, Data Consultancy Practice Dun & Bradstreet.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Parameterising Bayesian Networks: A Case Study in Ecological Risk Assessment Carmel A. Pollino Water Studies Centre Monash University Owen Woodberry, Ann.
Panel: The Art of Data Mining, and the Quest for Greater Insight Moderator: Moderator: Kate Smith-Miles, Deakin University, Australia Panelists: Panelists:
Empirical Financial Economics The Efficient Markets Hypothesis Review of Empirical Financial Economics Stephen Brown NYU Stern School of Business UNSW.
Validation of the Method Adoption Model for Functional Size Measurement of Web Applications Silvia Abrahão Valencia University of Technology, Spain
Data Mining – Intro.
Water Management Presentations Summary Determine climate and weather extremes that are crucial in resource management and policy making Precipitation extremes.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Overview of the Research Process in Economics Researchers in Economics, as a social science, use a version of the scientific method. The scientific method.
Bucharest, 10-February-2004 Neural Risk Management S.A. Scoring solutions Making full use of your data.
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry DSI Conference Jennifer Lewis Priestley.
If BIG DATA is the answer, then what was the question?
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
Slide Eastern Finance Association Annual Meeting 2009Andreas Dietrich SME Credit Availability Around the World: Evidence from the World Bank’s Enterprise.
1 Bob DeYoung’s comments on: “Does the Market Discipline Banks? New Evidence from Regulatory Capital Mix” Adam Ashcraft, Federal Reserve Bank of New York.
Probabilistic Mechanism Analysis. Outline Uncertainty in mechanisms Why consider uncertainty Basics of uncertainty Probabilistic mechanism analysis Examples.
Arben Asllani University of Tennessee at Chattanooga Business Analytics with Management Science Models and Methods Chapter 1 Business Analytics with Management.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
PhD Committee J. Vanthienen (promotor, K.U.Leuven) J. Vandenbulcke
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Outline Class Intros Overview of Course & Series Example Research Projects Beginning R.
1 f02kitchenham5 Preliminary Guidelines for Empirical Research in Software Engineering Barbara A. Kitchenham etal IEEE TSE Aug 02.
Are Real Estate Banks More Affected by Real Estate Market Dynamics? Evidence from the Main European Countries Lucia Gibilaro, University of Bergamo
Today Ensemble Methods. Recap of the course. Classifier Fusion
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
BUSINESS ANALYTICS. “THE EXTENSIVE USE OF DATA, STATISTICAL AND QUANTITATIVE ANALYSIS, EXPLANATORY AND PREDICTIVE MODELS, AND FACT-BASED MANAGEMENT TO.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
Experimental Algorithmics Reading Group, UBC, CS Presented paper: Fine-tuning of Algorithms Using Fractional Experimental Designs and Local Search by Belarmino.
Holly Wang Workshop at CAU December 15, 2010 Conducting Empirical Research and Publishing in International Journals.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
WERST – Methodology Group
EED 401: ECONOMETRICS COURSE OUTLINE
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Artificial Intelligence for Data Mining in the Context of Enterprise Systems Thesis Presentation by Real Carbonneau.
Risk Solutions & Research © Copyright IBM Corporation 2005 Default Risk Modelling : Decision Tree Versus Logistic Regression Dr.Satchidananda S Sogala,Ph.D.,
Logistic Regression An Introduction. Uses Designed for survival analysis- binary response For predicting a chance, probability, proportion or percentage.
Deciphering “Evidence” in the New Era of Education Research Standards Ben Clarke, Ph.D. Research Associate - Center for Teaching and Learning, University.
1 Assessing the robustness of meta-analytic results: Why sensitivity analyses matter Sven Kepes George Banks Michael A. McDaniel Traci Sitzmann.
Tim Friede Department of Medical Statistics
Advanced statistical methods for credit risk modeling in practice
Data Mining – Intro.
Writing a sound proposal
Business Intelligence Minor
Outlier Detection Identifying anomalous values in the real- world database is important both for improving the quality of original data and for reducing.
Data Based Decision Making
Leacock, Warrican and Rose (2009)
Technology Assessment and Acquisition in the US Electronics Manufacturing Industry Tugrul U Daim, Ph.D.
Regression Analysis Part D Model Building
Transfer Learning in Astronomy: A New Machine Learning Paradigm
Hedge Fund Regulation and Misreported Returns
Introduction Data Mining for Business Analytics.
Data Analysis Learning from Data
INTRODUCTION TO BUSINESS RESEARCH
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Warehousing Data Mining Privacy
Presentation transcript:

Science in Business Data Mining? Background: support managerial decision making Background: support managerial decision making Is there a science to data mining (with CI-methods)? Is there a science to data mining (with CI-methods)?Outline 1.Data Mining in Business & Management 2.Rules established in Business practices vs. Data mining? 1.Statistics vs. Data driven modelling 2.A personal view 3.How do develop meta-knowledge Sven F. Crone, Lancaster University Management School Research Centre for Forecasting YES, but it depends (and it may be empirical Wizardry driven by efficiency rather than effectiveness!)

Business Data Mining? Main areas for Data Mining: Main areas for Data Mining: Finance: Credit risk (personal & corporate) Finance: Credit risk (personal & corporate) Marketing:Customer Relationship Management (=Direct Marketing, Database Marketing) Marketing:Customer Relationship Management (=Direct Marketing, Database Marketing) Sven F. Crone, Lancaster University Management School Research Centre for Forecasting Churn Prediction Credit Scoring Direct Marketing adapted from Berry and Linoff (2004) and Olafson et al (2006)

Best practices Credit Scoring Small & Balanced classes Small & Balanced classes Use 2000 of minority class Use undersampling Discretise all (!) variables Discretise all (!) variables Binary dummies / WOE to capture non-linearity Use Logistic regression Use Logistic regression Cross-Selling Large & imbalanced sample Large & imbalanced sample Use large sample sizes Original (Imbalanced) class distribution … A personal view: Data selection is best using prior domain knowledge (use filters) Data selection is best using prior domain knowledge (use filters) Pre-processing more important than method [Crone et al, 2006; Keogh 2002] Pre-processing more important than method [Crone et al, 2006; Keogh 2002] (Balanced) sampling & pre-processing is method dependent (Balanced) sampling & pre-processing is method dependent Best practices exist & are domain dependent (e.g. homogeneous datasets in credit scoring) Best practices exist & are domain dependent (e.g. homogeneous datasets in credit scoring) Flat Maximum effect [Lovie & Lovie, 1986] Flat Maximum effect [Lovie & Lovie, 1986] Sven F. Crone, Lancaster University Management School Research Centre for Forecasting GAP  Extensive use of expert domain knowledge  efficient solution ≠ best Practitioners & Consultants use statistics

How do derive (meta)-knowledge? Lessons from other disciplines: Time Series Forecasting Lessons from other disciplines: Time Series Forecasting More ‘Evidence based methods” [Armstrong 2000] More ‘Evidence based methods” [Armstrong 2000] Empirical Evidence Empirical Evidence Conditions under which methods perform well (multiple hypothesis) Conditions under which methods perform well (multiple hypothesis) Domain specific Competitions (valid & reliable) Domain specific Competitions (valid & reliable) Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple homogeneous datasets from one domain Multiple homogeneous datasets from one domain Use of valid benchmark methods & unbiased error measures Use of valid benchmark methods & unbiased error measures Honour the domain & decision context (active learning, cost sensitive) Honour the domain & decision context (active learning, cost sensitive) Replications Replications Studies must allow replications – document all steps / parameters Studies must allow replications – document all steps / parameters  STOP FINE-TUNING / MARGINAL EXTENSION OF SINGLE METHOD ON SINGLE TOY DATASET  Develop solutions for domain (Why make life harder?) Where to start?  follow high impact approach! Where to start?  follow high impact approach! Identify most prominent application domains (e.g. credit risk) Identify most prominent application domains (e.g. credit risk) Select promising application domains for CI-methods Select promising application domains for CI-methods Get corporate sponsor & run competition Get corporate sponsor & run competition Analyse conditions (!) using meta-studies! Analyse conditions (!) using meta-studies! Embed findings as methodology in SOFTWARE Embed findings as methodology in SOFTWARE Sven F. Crone, Lancaster University Management School Research Centre for Forecasting

Literature Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal