Decision Analysis Lecture 11

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Managerial Decision Modeling with Spreadsheets
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Engineering Economic Analysis Canadian Edition
Planning under Uncertainty
Sparse vs. Ensemble Approaches to Supervised Learning
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Chapter 12 - Forecasting Forecasting is important in the business decision-making process in which a current choice or decision has future implications:
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Uncertainty in Engineering - Introduction Jake Blanchard Fall 2010 Uncertainty Analysis for Engineers1.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
by B. Zadrozny and C. Elkan
Verification & Validation
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 9 – Classification and Regression Trees
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Value of information Marko Tainio Decision analysis and Risk Management course in Kuopio
1 Statistical Distribution Fitting Dr. Jason Merrick.
Engineering Economic Analysis Canadian Edition
Statistical Applications Binominal and Poisson’s Probability distributions E ( x ) =  =  xf ( x )
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Machine Learning 5. Parametric Methods.
Classification and Regression Trees
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Math 6330: Statistical Consulting Class 9
Decision Analysis Lecture 12
Machine Learning: Ensemble Methods
Chapter Nine Hypothesis Testing.
Advanced Data Analytics
Decision Analysis Lecture 7
Lecture 1.31 Criteria for optimal reception of radio signals.
Math 6330: Statistical Consulting Class 2
Chapter 14 Introduction to Multiple Regression
OPERATING SYSTEMS CS 3502 Fall 2017
Figure 5: Change in Blackjack Posterior Distributions over Time.
Math 6330: Statistical Consulting Class 11
Math 6330: Statistical Consulting Class 5
Optimal Stopping.
Prepared by Lloyd R. Jaisingh
Forecasting Methods Dr. T. T. Kachwala.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Boosting and Additive Trees
CHAPTER 11 Inference for Distributions of Categorical Data
QUANTITATIVE ANALYSIS
Data Mining Lecture 11.
Professor S K Dubey,VSM Amity School of Business
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Stat 217 – Day 28 Review Stat 217.
MIS2502: Data Analytics Classification using Decision Trees
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CHAPTER 11 Inference for Distributions of Categorical Data
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 7: Transformations
Chapter 9 Hypothesis Testing: Single Population
Chapter 12 Analyzing Semistructured Decision Support Systems
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Decision Analysis Lecture 11 Tony Cox My e-mail: tcoxdenver@aol.com Course web site: http://cox-associates.com/DA/

Agenda Recommended readings Problem set 9 solutions Problem set 10 Where are we going? Wrap-up on multi-arm bandit (MAB) Thompson sampling Wrap-up on adaptive decisions Optimal stopping and optimal replacement Bayesian networks, CART, partial dependence plots Influence diagrams

Recommended Readings Skim Bell (1988), pages 1416-1418, on single-attribute utility theory http://www.people.hbs.edu/dbell/one%20switch.pdf Skim Abbas (2010), pages 62-67 and 74-77, on multiattribute utility theory (MAUT) http://create.usc.edu/sites/default/files/publications/tutorialmau.pdf

Homework #9, Problem 1 Updating a uniform prior Starting from a uniform prior, U[0, 1], for success probability, you observe 22 successes in 30 trials. What is your Bayesian posterior probability that the success probability is greater than 0.5? Bayesian updating: P(p > 0.5 | x = 22, n = 30) = 1 - pbeta(0.5, 23, 9) = 0.994663 Uses pbeta with updated parameters x + 1 and n + 2 based on x successes in n trials, x = 22, n = 30 Posterior is beta(x + 1, n - x + 1).

Homework #9, Problem 2: Spare parts In a manufacturing plant, it costs $10/day to stock 1 spare part, $20/day to stock 2 spare parts, etc. ($10 per spare part per day). There are 50 machines in the plant. Each machine breaks with probability 0.004 per machine per day. (More than one machine can fail on the same day.) If a spare part is available (in stock) when a machine breaks, it can be repaired immediately, and no production is lost. If no spare part is available when a machine breaks, it is idle until a new part can be delivered (1 day lag). $65 of production is lost. How many spare parts should the plant manager keep in stock to minimize expected loss?

Spare parts solution Number of failures in a day has binomial distribution with n = 50 machines and p = 0.004 per machine per day. Mean is np= 50*0.004 = 0.2 failures per day, so we expect about one breakdown every 5 days The cost of keeping 1 spare is $10/day The cost per machine failure is $65 if no spare is available, else 0

Spare parts (cont.) With 0 spares, the average loss per day due to failures is ($65 per unrepaired failure)*(0.2 failures/day) = $13/day With 1 spare, average loss/day ≈ $10/day + $65*(1 - pbinom(1, 50, 0.004)) = $11.12 Exact cost < $ 65*dbinom(2,50,0.004) + 2*65*dbinom(3,50,0.004) + 3*65*dbinom(4,50,0.004) + sum(c(4:50))*65*dbinom(5,50,0.004) = $11.34 With 2 spares, average loss > $20/day So, optimal decision is: Stock 1 spare

Homework #9 discussion problem for April 11 (uncollected/ungraded) Choice set: Take or Do Not Take Chance set (states): Sunshine or Rain P(Sunshine) = p = 0.6 Utilities of act-state pairs: u(Take, Sunshine) = 80 u(Take, Rain) = 80 u(Do Not Take, Sunshine) = 100 u(Do Not Take, Rain) = 0

Homework #9 discussion problem (uncollected/ungraded) If p = 0.6, find EU (Take) and EU(Don’t Take) using Netica Goal is to see how Netica deals with decisions and expected utilities May also try it via simulation Update these EUs if a forecast (with error probability 0.2) predicts rain

The Umbrella Problem Choice set: Take or Do Not Take Chance set (states): Sunshine or Rain P(Sunshine) = p = 0.6 Utilities of act-state pairs: u(Take, Sunshine) = 80 u(Take, Rain) = 80 u(Do Not Take, Sunshine) = 100 u(Do Not Take, Rain) = 0

Netica solution without forecast known

Netica solution with forecast of rain Netica automatically updates EU of different decisions, given observations. Influence diagrams are just as easy to create and solve as Bayesian networks

Solving by simulation in R: Part A (no forecast) # Initialize variables weather = utility = NULL; p = 0.6; n = 100000; # simulate weather state weather = rbinom(n, 1, p); # calculate and report simulated probability of sun psun = mean(weather); psun # calculate and report utilities of each action EUtake = psun*80+(1- psun)*80; EUleave = psun*100+(1- psun)*0; EUtake; EUleave

Solving by simulation in R: Part A (no forecast) > # Initialize variables > weather = utility = NULL; p = 0.6; n = 100000; > # simulate weather state > weather = rbinom(n, 1, p); > # calculate and report simulated probability of sun > psun = mean(weather); psun [1] 0.59519 > # calculate and report utilities of each action > EUtake = psun*80+(1- psun)*80; > EUleave = psun*100+(1- psun)*0; > EUtake; EUleave [1] 80 [1] 59.519

Solving by simulation in R Part B (with forecast) # Initialize variables weather = forecastsun = utility = NULL; p = 0.6; n = 100000; # simulate weather states and forecasts weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2); forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun = weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain; forecastRain=1-forecastsun # calculate and report desired conditional probability psunifForecastRain = sum(weather*forecastRain)/sum(forecastRain); psunifForecastRain # calculate and report utilities of each action EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80; EUleave = psunifForecastRain*100+(1- psunifForecastRain)*0; EUtake; EUleave

Solving by simulation in R Part B (with forecast) > # Initialize variables > weather = forecastsun = utility = NULL; p = 0.6; n = 100000; > # simulate weather states and forecasts > weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2); forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun = weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain; forecastRain=1-forecastsun > # calculate and report desired conditional probability > psunifForecastRain = sum(weather*forecastRain)/sum(forecastRain); psunifForecastRain [1] 0.2764975 > # calculate and report utilities of each action > EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80; EUleave = psunifForecastRain*100+(1- psunifForecastRain)*0; EUtake; EUleave [1] 80 [1] 27.64975

Homework #10 (optional) (Due by 4:00 PM, April 18 if you want it graded) A deep sea oil drilling platform has a normally distributed lifetime (until failure) with a mean of 30 years and a standard deviation of 4 years. While it is operating, the platform produces oil worth $10M per year. Voluntarily stopping operations and closing down the platform costs $0. Having the platform fail while still in use leads to involuntary closure and a cost of $50M. At what age should the platform be voluntarily shut down (if it has not yet failed)? Hint: Continue until marginal benefit < expected marginal cost of continuing Hint: Use a hazard function calculator for normally distributed lifetimes, e.g., http://reliabilityanalyticstoolkit.appspot.com/normal_distribution

Where are we going? Student observation 1: “Here is my solution for homework 9. I have to say the course is really getting deeper and deeper. And I'm trying to understand all the content.” Student observation 2: If I could totally understand and use a few of these techniques, it might be more useful than seeing more. (Exploration vs. exploitation trade-off.) Goals: Be able to use key techniques (e.g., computing EMV with known probabilities and values) that are relatively simple and useful. Be aware of more advanced methods and “big ideas” that may prove useful; know what to look for or ask for as a manager

Top-level ideas Many valuable decision problems can be solved using the philosophy of simulation-optimization: Try different decisions, evaluate their probable consequences, choose the one with best (EMV or EU-maximizing) probability distribution of consequences Both the probability modeling (“simulation”) and the search for a best decision or decision rule can become very technical For business applications, understanding what problems can be solved and how to formulate them is the most important part. (The rest is “just software” or using appropriate expertise)

MAB Thompson sampling (cont.)

Thompson sampling and adaptive Bayesian control: Bernoulli trials Basic idea: Choose each of the k actions according to the probability that it is best Estimate the probability via Bayes’ rule It is the mean of the posterior distribution Use beta conjugate prior updating for “Bernoulli bandit” (0-1 reward, fail/succeed) Sample from posterior for each arm, 1… k; choose the one with highest sample value. Update & repeat. S = success F = failure http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012

Thompson sampling: General stochastic (random) rewards Second idea: Generalize to arbitrary reward distribution (normalized to the interval [0, 1]) by considering a trial a “success” with probability equal to its reward http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012

Thompson sampling with complex online actions Main idea: Embed simulation-optimization in Thompson sampling loop  = state space, S Sample the states Applications: Job scheduling (assigning jobs to machines); web advertising with reward depending sets of ads shown Y = observation h = reward, X = random variable depending on  Updating posteriors can be done efficiently using a sampling-based approach (particle filtering) Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf

Comparing methods In simulation experiments, Thompson sampling works well with batch updating, even with slowly or occasionally changing rewards and other realistic complexities. Beats UCB1 in many but not all comparisons More practical than UCB1 for batch updating because it keeps experimenting (trying actions with some randomness) between updates. http://engineering.richrelevance.com/recommendations-thompson-sampling/

MAB variations Contextual bandits Adversarial bandits See signal before acting Constrained contextual bandits: Actions constrained Adversarial bandits Adaptive adversaries Bubeck and Slivens, 2012, https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/COLT12_BS.pdf Restless bandits: Probabilities change Gittins index maximizes expected discounted reward, not easy to compute Correlated bandits

Wrap-up on MAB problems Adaptive Bayesian learning works well in simple environments, including many of practical interest The resulting rules are *much* simpler to implement than previous methods (e.g., Gittins index policies) Sampling-based approaches (Thomposn, particle filtering, etc.) promote computationally practical “online learning”

Wrap-up on adaptive learning No need for a causal model Learn act-consequence probabilities and optimal decision rules directly Assumes a stationary (or slowly changing) decision environment, known choice set, immediate feedback (reward) following action Works very well when these assumptions are met: low-regret learning is possible

Optimal stopping

Optimal stopping decision problems Suppose that a decision-maker (d.m.) faces a random sequence of opportunities How long to wait for best one? When to stop and commit to a final choice? Examples: Selling a house, hiring a new employee, accepting a job offer, replacing a component, shuttering an aging facility, taking a parking spot, etc. Other optimal stopping problems: Least-cost policies for replacing aging components

Hazard functions: Conditional rate of failure given survival so far Let T = length of life for a component (or person, or time until first occurrence of an event, etc.) T is a random variable with cdf F(t) = Pr(T < t) and survival function S(t) = 1 – F(t) = Pr(T > t) The pdf for T is then f(t) = F’(t) = dF(t)/dt The hazard function for T is defined as: h(t) = limdt0Pr(t < T < t + dt | T > t)/dt h(t) = f(t)/S(t) = f(t)/[1 – F(t)] Interpretation: “instantaneous failure rate” h(t)dt  Pr(occurs in next dt | survival until t) In discrete time, dt = 1, no limit is taken

Using hazard functions to guide decisions The shape of the hazard function can often guide decisions, e.g… If h(t) is increasing, then optimal time to stop is when h(t) reaches a certain threshold If h(t) is decreasing, then best decision is either don’t start or else continue until failure occurs Normal distribution hazard function calculator is at http://reliabilityanalyticstoolkit.appspot.com/normal_distribution SPRT and other calculators: http://reliabilityanalyticstoolkit.appspot.com/ www.wolfram.com/mathematica/new-in-9/enhanced-probability-and-statistics/define-a-distribution-given-its-hazard-function.html https://www.ncss.com/software/ncss/survival-analysis-in-ncss/

Example: optimal age replacement The lifetime T of a component is a random variable with known distribution Suppose it costs $10 to replace the plant before it fails and $50 to replace it if it fails. When should the component be voluntarily replaced (if not failed yet)? Answer can be calculated by minimizing expected average cost per cycle (or equating marginal benefit to marginal cost for continuing), but calculations are detailed and soon get tedious Alternative: Google “optimal replacement age calculator”

Optimal age replacement calculator http://www.reliawiki.org/index.php/Optimum_Replacement_Time_Example

Optimal selling of an asset If offers arrive sequentially from a known distribution and costs of waiting are known, then an optimal decision boundary (blue) can be constructed to maximize EMV Sell when red line first hits blue decision boundary W(t) = price series S(t) = maximum price so far http://file.scirp.org/Html/9-1040163_25151.htm

Optimal stopping: Variations Offers arrive sequentially from an unknown distribution Bayesian updating provides solutions Time pressure: Must sell by deadline, or fixed number of offers With or without being able to go back to previous offers Sell when blue line first hits green decision boundary http://file.scirp.org/Html/9-1040163_25151.htm

Wrap-up on optimal stopping and statistical decision theory Many valuable decision problems can be solved using the philosophy of simulation-optimization: Try different decisions, evaluate their probable consequences Choose the one with best (EMV or EU-maximizing) probability distribution of consequences Finding a best decision or decision rule can become very technical Use appropriate software or on-line calculators For business applications, understanding how to formulate decision problems and solve them with software can create high value in practice

Introduction to descriptive analytics using information-based algorithms (BN learning, CART trees, randomForest, partial dependence plots)

Descriptive analytics: What’s going on? What is the current situation? Attribution: How much harm/loss/opportunity cost is being caused by X? Causes are often unobserved or uncertain What has changed recently? (Why?) Example: More extreme event reports caused by real change or by media? Change-point analysis (CPA) algorithms What should we worry about? How is this year’s season shaping up?

Air pollution example in CAT* Load data in Excel, click Excel to R to send it to R Los Angeles air basin 1461 days, 2007-2010 (Lopiano et al., 2015, thanks to Stan Young for data) PM2.5 data from CARB Elderly mortality (“AllCause75”) from CA Department of Health Daily min and max temps & max relative humidity from ORNL and EPA Risk question: Does PM2.5 exposure increase elderly mortality risk? If so, by how much? (P(death | PM2.5) = ?) Causal Analytics Toolkit, http://cox-associates.com/downloads/

Classification Trees A powerful, popular method for data mining, machine learning, and CPT estimation In R, partytree, ctree, rpart, and other algorithms provide CART (Classification and Regression Tree) algorithms Can download applet from: www.cs.ubc.ca/labs/lci/CIspace/dTree/ Basic idea: “Always ask the most informative question next” for reducing uncertainty (conditional entropy) about the dependent variable. Partitions a set of cases into groups or clusters (the leaf nodes) with similar conditional probabilities for dependent variable.

Air pollution-mortality example: Classification tree descriptive analytics tmin, tmax, month, year, MAXRH are identified as potential predictors of AllCause75 (elderly mortality) PM2.5 does not appear in this tree AllCause75 = elderly mortality count per day is conditionally independent of PM2.5 in this analysis, given the other variables in the tree Making year and month into categorical variables changes the tree but not this conclusion.

How a CART tree works: Concept Basic idea: Always ask the most informative question next, given answers so far. Questions are represented by splits in tree Leaf nodes show conditional means (or conditional distributions) of dependent variable Internal nodes show significance level for split: how significant are differences between conditional distributions Can handle continuous, categorical, ordered categorical, and binary variables and missing values Reduces prediction error for dependent variable Stop this “recursive partitioning” when further questions (splits in tree) do not significantly improve prediction. Classification & Regression Tree (CART) algorithm Which variables are informative in predicting dependent variable can depend on what other predictors are included. Taking out month makes PM2.5 informative. Some refinements: Grow a large tree and prune back to minimize cross-validation error fit multiple trees to random subsets of data and let them vote for best splits (“bagging”) over-train on mis-predicted cases (“boosting”) average predictions from many trees (“RandomForest” ensemble prediction) Join prediction “patches” together smoothly (MARS)

How to read a CART tree Tips of the tree (“leaf nodes” give the conditional expected value (for continuous variables) or conditional distributions (for discrete variables) of the dependent variable, given the value ranges of the variables along the path from the top of the tree (the “root node”) to the leaf node. The dependent variable is called “y” by default in the tree Branches (“splits”) show the value ranges being conditioned on The tips of the tree also show how many cases in the data set are described by the combination of values leading to that leaf node. These are the “n” values at the leaf nodes Example: n = 92 days had tmin < 43 degrees and PM2.5 < 14 g/m3. An average of y = 141.696 elderly people per day died on days with this description.

Creating a CART tree in CAT Select dependent variable first, by clicking on a column heading in the Data sheet Select predictor columns using Ctrl + click Can select in any order Click on desired analysis (Tree) Output appears on new tab (If unsure what analyses to do after columns selected, click on Analyze.)

Classification/Decision Tree Algorithms in Practice Different decision tree (= classification tree) algorithms use different splitting, stopping, and pruning criteria. Examples of tree algorithms: CHAID: chi-square automatic interaction detection CART: Classification and regression trees, allows continuous as well as discrete variables MARS: Multiple adaptive regression splines, smooths the relations at tree-tips to fit together smoothly KnowledgeSeeker allows multiple splits, variable types ID3, C5.0, etc.

Recursively identify parents (or potential parents) for each node.

Causal graph structure # 1 Q3h reviewed check list Net1_3 Helped me choose products Q7_1 End-to-end experience Q3f offered to set up on-line billing NET1_5 Wait time ok Q3b offered to demonstrate Q2uu acknowledged when entered store Q2jjj questions answered

Classification-tree Tests of CI In a classification tree, the dependent variable (root node) is conditionally independent of all variables not in the tree, given the variables that are in the tree (at least as far as the tree-growing heuristic can discover). Starting with a childless node (output node), we can recursively seek direct parents of all nodes.