Download presentation
1
“I Don’t Need Enterprise Miner”
David Yeo, Ph.D. SAS Institute (Canada) Inc. Copyright © 2011, SAS Institute Inc. All rights reserved.
2
Overview The Case Against Using Enterprise Miner.
The Case For Using Enterprise Miner. Questions. Copyright © 2011, SAS Institute Inc. All rights reserved.
3
The Case Against Using Enterprise Miner
The arguments for coding over using Enterprise Miner, are typified by the following statements: “I like to code.” “I don’t want to lose the time invested developing my code.” “My code has proven reliable in past”. “I understand what is going on in my code; I don’t fully understand what is going on in Enterprise Miner.” “I like to code” You don’t have to give up coding. Just use EM to handle the tedious preparation tasks (e.g. imputation) “I don’t want to lose the time invested developing my code” You can still use your favorite SAS programs and models SAS Code node (code editor window) Model Import node (import and assess non-EM models) “My code has proven reliable in past” Every time you rebuild or update an existing model, “all bets are off”; the new model may be quite different from your tried-and-true program EM is designed to help you build reliable models “I understand my code … but not EM” EM is complex, but it is not a “black-box” solution Each node in EM has a SAS PROC behind it, often the same one you would have used had you coded the model using SAS/STAT The available setting options and results listing are familiar Education: there is both basic and advanced training available for EM Copyright © 2011, SAS Institute Inc. All rights reserved.
4
The Case For Using Enterprise Miner
Intuitive “drag-and-drop” interface Simplify tedious data preparation tasks. Implement powerful advanced modeling techniques. Integrate decision theory into your decisions. Incorporate your favorite SAS programs and procedures. Use Enterprise Miner as a code generator. EM offers many features for both novice and experienced modelers: For less experienced modelers, the drag-and-drop interface and built-in defaults substantially speeds up model creation and testing This can substantially reduce their training time, freeing up more experienced modelers to work on their own projects. For experienced SAS users, EM provides tools and built-in capabilities to speed up tedious tasks like dummy coding and missing value imputation EM also offers powerful modeling techniques that are not available in SAS/STAT, e.g. decision trees, neural networks. It also allows you to apply decision theory (i.e. profit and prior probability) It is even possible to run your existing SAS code/models within EM Finally, EM can be used simply to generate code (in various formats) Copyright © 2011, SAS Institute Inc. All rights reserved.
5
Intuitive “Drag-and-Drop” Interface
EM is frequently promoted on the basis of its ease of use (for good reason); It features an intuitive drag-and-drop interface in which “nodes”, representing processing steps in the analysis, are strung together into a “flow diagram” Often familiar SAS procedures underlie these nodes, which means that the results reported by the nodes will also be familiar to you These procedures can offer a staggering array of settings and options Fortunately, these options have been set to reasonable default values in EM, speeding up model construction (particularly for novice modelers) EM also provides extensive online documentation on the interface and nodes, as well as context sensitive help to define the various node settings Sensible defaults facilitate rapid model construction. Extensive documentation and context-sensitive help. Copyright © 2011, SAS Institute Inc. All rights reserved.
6
Simple Statistical Graphics
Offers an extensive range of plots including: histograms, scatterplots, contour plots, and even 3-D rotating plots. Often the graphs are fully interconnected. Exploration of your data is a first priority when modeling In the old days you practically needed a Ph.D. to generate SAS graphics With the emergence of guided graphics construction in products like Enterprise Guide and Enterprise Miner, now graphs are easy to generate For instance, each node has an Explore option that allows you to easily display the distribution of the input and target variables Moreover, these plots are fully interconnected, allowing you to explore the relationship between variables There are also displays that highlight the relationship between the input and target variables by overlaying the target on each input (MultiPlot node) Copyright © 2011, SAS Institute Inc. All rights reserved.
7
Automatic Design (Dummy) Coding
Nominal and ordinal variables are automatically design (a.k.a. dummy) coded for use in subsequent models. Either ‘effect’ or ‘reference cell’ coding can be specified. 1 Level DA DB DC DD DE DF DG DH DI A B C D E F G H I In many models (e.g. regression), all nominal variables must be transformed into a set of binary indicators before the variables can be used This is known as design coding, and it can be a tedious task, if done manually PROC GLMMOD’s OUTDESIGN= option in can be used to generate a design matrix for several SAS procedures (e.g. PROC GLM) In EM, once the level of each variable has been declared, the dummy coding of nominal and ordinal variables is automatically performed Either “effect” or “reference cell” dummy coding variants can be specified. ... Copyright © 2011, SAS Institute Inc. All rights reserved.
8
Variable Selection SAS Enterprise Miner offers an extensive set of variable selection methods: Sequential (stepwise) selection R-square or chi-square based selection Split search selection Identifying an appropriate set of inputs variables (variable selection) is clearly one of the most important model construction tasks The goal is to identify and remove “irrelevant” and “redundant” variables Underscoring its importance, EM offers many variable selection techniques to guide the identification and removal of irrelevant input variables: Of course all the classic regression “stepwise” methods are available: Forward, Backward and Stepwise There is also a specialty node that either performs an enhanced variant of forward selection or a simple variant of split search (tree) selection Alternatively, split search selection can directly be performed using a decision tree, which offers the advantage of retaining nonlinearities Finally, if partial least squares regression is used, a measure of variable importance (known as VIP) can guide variable selection A Variable Clustering node, which implements PROC VARCLUS, has also been provided to help you detect and remove redundant input candidates The Variable Clustering node groups redundant variables, allowing you to select one variable from each group and throw out the rest Variable importance in the projection Variable clustering Copyright © 2011, SAS Institute Inc. All rights reserved.
9
Missing Value Imputation
Synthetic (e.g. mean, mode). Synthetic distribution Estimation (e.g. distribution, decision tree). Estimation xi = f(x1, … ,xp) Another potentially challenging pre-modeling task is missing value imputation If simple “synthetic” imputation is used, in which the missing value is replaced by a constant (e.g. mean), then the challenge is minimized But synthetic methods inevitably distort the underlying variable distribution More sophisticated methods use estimates based either on the non-missing inputs, or on the distribution itself, to determine the appropriate value EM offers several of these sophisticated imputation techniques, including:: Tree-based imputation and Distribution-based imputation These more sophisticated missing value estimation techniques would be extremely challenging and time consuming to code from scratch EM can also automatically generate indicator variables that will be set to 1 if the associated value was imputed (set to 0 otherwise) This is particularly important if the “missing-ness” is not random Copyright © 2011, SAS Institute Inc. All rights reserved.
10
Variable Transformation
Simple (e.g. log) and advanced (e.g. optimal binning). skewed distribution standard regression true association Original Scale more symmetric distribution Transformed Scale standard regression Transforming non-normal variables is also often a modeling prerequisite For instance, transformation may be necessary to make nonlinear input-output relationships more linear Transformations can also help reduce the distorting impact of outliers. EM offers both simple transformation options (e.g. LOG, SQRT) as well as more complex transformation options such as: Maximize Normality: Automatically select the transformation that makes the variable as normally distributed as possible Maximize Correlation Automatically select the transformation that maximizes the input variable’s correlation with the target As with the more sophisticated imputation methods, it is likely be difficult and time consuming to code these complex transformation methods from scratch. Copyright © 2011, SAS Institute Inc. All rights reserved.
11
Association Analysis Forms simultaneous or sequential associations. A
B C A C D B C D A D E B C E Rule Support Confidence A implies D 2/5 2/3 C implies A 2/4 A implies C B and C implies D 1/5 1/3 As is generally well known, EM offers a number of modeling algorithms that are not available elsewhere in Base SAS or SAS/STAT For instance, the Association node in EM performs "market basket analysis”, a an algorithm that attempts to find associations between things e.g. Customers who purchased Product1 and Product2 are also likely to purchase Product3 and Product. These associations can either be at one point in time or across time The derived association rules can be used to guide upsell/cross sell decisions, bundling decisions, product placement decisions, etc. Copyright © 2011, SAS Institute Inc. All rights reserved.
12
Decision Trees Enterprise Miner implements all of the major decision tree variants, i.e. CART, CHAID, and entropy-based. A decision tree is another popular advanced modeling technique that is only available in EM (and JMP) Trees are particularly popular during initial data analysis because: They are highly resistant to the “curse of dimensionality”, selecting only relevant inputs from a potentially long list of input candidates. They do not require missing value imputation (so the data is not distorted).. They can efficiently fit complex nonlinear input-output relationships. They produce IF..THEN rules that are easy to interpret, helping you to understand the underlying relationships in your data. EM implements all of the major tree variants (i.e. CART, CHAID, and entropy), as well as hybrids of these classic approaches In EM trees can even be constructed interactively, affording you complete control over both the growing and pruning phases. Copyright © 2011, SAS Institute Inc. All rights reserved.
13
Consolidation Trees Combines categorical levels that have a similar outcome. x2 70% HI EFG x Level DA DB DC DD DE DF DG DH A B C D E F G H I x1 J ABCD ABCDJ HI EFG EFGHI Trees can be effectively applied in a wide range of non-modeling tasks As already noted, decision trees can be productively used for tasks like variable selection and missing value imputation They can also be used to effect dimension change For example, consolidation trees collapse nominal inputs in a principled way, minimizing the number of dummy variables input into subsequent models Copyright © 2011, SAS Institute Inc. All rights reserved.
14
Neural Networks PROC NEURAL is one of SAS’ most powerful statistical procedures (it’s a universal approximator)! Available neural network architectures include: MLP, RBF, VQ, SOM, and functional-link networks. hidden layer output input H1 H3 H2 Y x1 x2 EM contains a number of neural network algorithms including: MLP, RBF, VQ, SOM, and functional-link networks (DMNEURAL) Based on a attempt to model what biological brains do, neural networks are nonlinear models capable of fitting any input-output relationship In other words, a neural network is a “universal approximator” In fact, the procedure underlying EM’s Neural Network node,(NEURAL) is one of SAS’ most powerful statistical procedures It can do most of what the GLM, REG, GENMOD, CATMOD, and LOGISTIC procedures can do … and more ... Copyright © 2011, SAS Institute Inc. All rights reserved.
15
Combined Models Perturb and combine methodology (ensemble model).
Combine class probability model and continuous-valued prediction model (two-stage model). Combines predictions from multiple models to create a single consensus prediction. In addition to providing a set of advanced modeling algorithms, EM allows you to combine the prediction of multiple models into a single estimate The Ensemble node, for example, allows you to combine the predictions of several different models, or multiple version of one model, into an estimate The integrated perturb-and-combine strategy facilitates both “bagging” (iterative resampling) and “boosting” (weighted resampling) And the Two Stage node combines the probability that a prospect will respond to a campaign with the expected revenue they will provide Copyright © 2011, SAS Institute Inc. All rights reserved.
16
Prior Probability Enterprise Miner applies prior probability information to correct probability estimates for oversampling. Decision/Action Decision/Action 1 1 Actual Class 1 Due to time pressures, the principles of decision theory can easily be overlooked during model construction and assessment Overlooking these principles can be extremely hazardous to your model’s health For instance, if your data was oversampled, all of your probability estimates will be wrong unless corrected by applying the population prior probabilities EM offers you the ability to enter and apply prior probability information When prior probabilities are entered, the probability estimates of all models are simultaneously corrected Adjusted for Priors Copyright © 2011, SAS Institute Inc. All rights reserved.
17
Bayesian optimal decision threshold
Profit Matrix The profit matrix sets the optimal decision cutoff value. 15.14 solicit ignore primary event secondary -0.68 1 d - + ≥ FP TN FN TP p ^ solicit EM also offers the ability to enter values into a profit/loss matrix, in order to reflect the implied consequences of the various outcomes The above profit matrix, for instance, reflects the expectation that a prospect who is correctly identified as a donor will return $15.14 profit Profit = donation amount - contact cost = $ $0.68 = $15.14 The above matrix also states that mistakenly identifying a non-responder as a responder will cost you 68 cents (therefore profit = - $0.68) Entering profit information has an important secondary effect; it sets the decision cutoff to the Bayesian optimal value Here the Bayesian optimal cutoff would be 0.68/15.82 = This means that any predicted probability greater or equal to would be declared as a “solicit” recommendation; otherwise “ignore” Bayesian optimal decision threshold p ≥ 0.68/15.82 solicit ^ p < 0.68/15.82 ignore ^ Copyright © 2011, SAS Institute Inc. All rights reserved.
18
Conforming Profit If no profit matrix is available, use “conforming profit” to properly set the Bayesian optimal cutoff value. 1/1 solicit ignore primary event secondary 1 p + ≥ ^ solicit 1/0 If no profit/loss matrix is provided, EM defaults to a matrix that minimizes the number of misclassifications made The implied Bayesian optimal cutoff for the default profit matrix is 0.5 But EM now offers the option of using a “conforming” profit matrix In a conforming profit matrix, the consequence of a true positive is set to the reciprocal of the proportion of the one-event in the population. And the true negative consequence is set to the reciprocal of the proportion of the zero-event in the population The false positive and false negative consequences are set to zero This sets the cutoff to the Bayesian optimal value, where 1 is the population proportion of the primary event, and 0 is the proportion of the secondary event. Copyright © 2011, SAS Institute Inc. All rights reserved.
19
Adding SAS Programs A SAS Code node can run any data step or licensed SAS procedure right within the data flow diagram. This allows you to add SAS procedures and custom code not currently available as nodes in Enterprise Miner. It also means you do not have to give up your favorite and familiar SAS programs and procedures! Your SAS code goes here. EM is a superset of standard SAS (it sits on top of Base SAS and SAS/STAT) The SAS Code node provides an editor window that allows you to either read in existing SAS programs or to write your own SAS code from scratch. A number of macro variables have been defined to make your programming task easier (e.g. EM_INTERVAL_INPUTS). Through a SAS Code node it is possible to implement SAS data step code or to access any SAS procedure for which you are licensed, from within EM This means that your existing SAS code “gems” can still be used in EM Copyright © 2011, SAS Institute Inc. All rights reserved.
20
Automated Model Assessment
Simultaneous assessment of multiple models using both statistical and graphical information. Can assess models either on training or holdout data. Offers a wide array of model selection options including: ASE, c-statistic (ROC index), and misclassification rate. As advanced modelers know, model assessment must be done with care Ideally assessment should be performed using holdout data, as assessing a model on training data leads to overly optimistic expectations EM’s Assessment node allows you to simultaneously compare multiple models against the training, validation, or test data set A wide range of statistics can be used for model selection Of course the statistic selected should reflect the type of prediction desired Decisions (easiest) use misclassification rate/accuracy Rankings use the C-statistic (a.k.a. ROC Index) Estimates (hardest) use average squared error Copyright © 2011, SAS Institute Inc. All rights reserved.
21
Enterprise Miner as a Code Generator
The entire data flow diagram can be output as: Base SAS code (SAS/STAT is not required) HTML code C code Finally, to facilitate model integration into a production environment, the entire EM diagram can be output in one of three code forms: Base SAS code (SAS/STAT is not required) C code JAVA code In other words, EM is a code generator The entire flow is captured, including any imputations, transformations, etc. This means that EM can be used simply to prepare your input data, even if an externally (SAS/STAT) coded model will ultimately be used in production Copyright © 2011, SAS Institute Inc. All rights reserved.
22
Questions Contact Information: David Yeo, Ph.D. SAS Institute (Canada) Inc. Copyright © 2011, SAS Institute Inc. All rights reserved.
23
Copyright © 2011, SAS Institute Inc. All rights reserved.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.