Data Mining in Action: A Case Study Drew Minkin Archaeus Design Systems minkind@archaeusdesign.com SQL Saturday #52 - Colorado
Drew Minkin Past Present Analytics Architect at Zilliant Senior Consultant, Fujitsu 6+ years Microsoft Services Escalation Engineer Dedicated Field Engineer (“Alliance”) Local speaker for SQL and BI Lecturer, SMU’s Business School BI Grad Cert Program Present Business Intelligence Architect at FiServ ISV Predictive Analytics Architect, Archaeus Design Systems
Agenda Data Mining Intro DM Methodology Data Concepts Validating and Testing Models Applying Output with Scorecards
For Future Reference http://archive.ics.uci.edu/ml/ http://www.kdnuggets.com/
? Data Mining Intro That big question mark is editable – use whatever character you want!
Data Mining Intro Methodology Architecture Information Flow Technologies
Data Mining in the BI Spectrum
Data Mining Information Flow
Data Mining Architecture
Data Mining Methodology ? Data Mining Methodology That big question mark is editable – use whatever character you want!
Methodology Problem Definition Data Modeling Data Discovery Analytics Modeling Applied Analytics Model Validation
Problem Description Business case and non-technical details of predictive analytics inquiry Business objectives and success criteria Requirements, assumptions and constraints Project plan , Risks and contingencies Data mining goals and success criteria Terminology, tools and techniques
Data Discovery Analysis of source data for structural and content gaps Data collection report Data description report Data exploration report Data quality report
Data Modeling Selection and manipulation of source data into a conformed entity input ready for formal exploration Dataset and dictionary and rationale Data cleansing report Derived attributes Generated merged and reformatted data
Analytics Modeling Research and analysis of patterns and creation of data mining models Model Modeling technique Modeling assumptions Test design Parameter settings Model description
Model Trials Testing data mining models using different algorithms and validation of statistical significance Revised Parameter settings Model Validation plan Model assessment
Applied Analytics Integration of models with new data Deployment plan Monitoring and maintenance plan Final report Final presentation Experience documentation
? Data Concepts That big question mark is editable – use whatever character you want!
Cases: What We Study Case – set of columns you want to analyze Age, Gender, Region, Annual Spending Case Key – unique ID of a case A column has: Data Type Content Type And optionally: Distribution Discretization Related Columns Flags (e.g. NOT NULL)
Column Data Types We don’t care about detailed low-level types DM only uses: Text Long Boolean Double Date and by some 3rd party algorithms: Time, and Sequence
Column Content Types Common: For special purposes: Denotes a key: DISCRETE Red, Blue CONTINOUS $6,511.49 DISCRETIZED 1-5, 6-20, 21+ Denotes a key: KEY For special purposes: KEY SEQUENCE KEY TIME ORDERED CYCLICAL
Column Usage Some algorithms interpret this in different ways, but in general, columns are for: Input For predicting another column PREDICT These columns are both predicted and act as inputs for predicting others PREDICT_ONLY Not used as input Columns can be input or predictable or both
Discretization When you don’t need to analyze full continuous range DM automatically convert data into buckets By default, into 5 Techniques: AUTOMATIC CLUSTERS EQUAL_AREAS THRESHOLDS
Distribution If you know the distribution of your data (you should), indicate it: NORMAL Typical Gaussian bell-curve LOG NORMAL Most values at the “beginning” of the scale UNIFORM Flat line – equally likely or perfectly random Other distributions can exist, but you cannot indicate them – algorithm will work fine
Distribution Nested Case – case containing a table column Purchases of a Customer Used for analyzing patterns in a relationship It has a Nested Key Not a “relational” foreign key! Normally, the Nested Key is a column you want to analyze E.g.: Product Name or Model
Testing and Validation ? Testing and Validation That big question mark is editable – use whatever character you want!
Algorithms and Use Cases Classification Segmentation Association Forecasting Text Analysis Advanced Data Exploration Estimation Association Rules Clustering Decision Trees Linear Regression Logistic Regression Naïve Bayes Neural Nets Sequence Clustering Time Series
Algorithms and Use Cases Drillthrough PMML DM Dimension Association Yes No Clustering Decision Trees Linear Regression Logistic Regression Naive Bayes Neural Network Sequence Clustering Time Series
Varying Model Input
Varying Model Input
Fields to Notice AVGGIFT Average dollar amount of gifts to date INCOME HOUSEHOLD INCOME LASTGIFT last donation amount MAXRAMNT Dollar amount of largest gift to date MINRAMNT Dollar amount of smallest gift to date RAMNTALL Dollar amount of lifetime gifts to date WEALTH1 Wealth Rating WEALTH2 Wealth Rating STATE State abbreviation (a nominal/symbolic field)
Fields to Notice Donor Rank DOMAIN/Cluster code. A nominal or symbolic field. could be broken down by bytes as explained below. 1st byte = Urbanicity level of the donor's neighborhood U=Urban C=City S=Suburban T=Town R=Rural 2nd byte = Socio-Economic status of the neighborhood 1 = Highest SES 2 = Average SES 3 = Lowest SES except for Urban communities, 1 = Highest SES, 2= Above average SES 3 = Below average SES 4 = Lowest SES.
Results from Discrete Donation
Results from Discretized Donation
? Scorecarding Demo That big question mark is editable – use whatever character you want!
The Future of SQL Data Mining + = http://dejasu.wordpress.com/2008/01/28/knowledge-wisdom-other/question_mark.jpg
Acknowledgements www.crisp-dm.org www.sqlserverdatamining.com Masao Okada Rafal Lukawiecki Eugene A. Asahara