Math 6330: Statistical Consulting Class 2

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Objectives of Multiple Regression
Data Mining Chun-Hung Chou
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Chapter 1: Introduction to Statistics
DR. AHMAD SHAHRUL NIZAM ISHA
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
MBA7025_01.ppt/Jan 13, 2015/Page 1 Georgia State University - Confidential MBA 7025 Statistical Business Analysis Introduction - Why Business Analysis.
Today Ensemble Methods. Recap of the course. Classifier Fusion
MBA7020_01.ppt/June 13, 2005/Page 1 Georgia State University - Confidential MBA 7020 Business Analysis Foundations Introduction - Why Business Analysis.
Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.
The Scientific Method. Objectives Explain how science is different from other forms of human endeavor. Identify the steps that make up scientific methods.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Konstantina Christakopoulou Liang Zeng Group G21
RECITATION 4 MAY 23 DPMM Splines with multiple predictors Classification and regression trees.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
Decision Analysis Lecture 12
Understanding Standards: Advanced Higher Statistics
Advanced Data Analytics
Math 6330: Statistical Consulting Class 7
Decision Analysis Lecture 7
Math 6330: Statistical Consulting Class 3
Math 6330: Statistical Consulting Class 6
Math 6330: Statistical Consulting Class 5
MIS2502: Data Analytics Advanced Analytics - Introduction
Writing Research Proposals
AF1: Thinking Scientifically
Math 6330: Statistical Consulting Class 8
Eco 6380 Predictive Analytics For Economists Spring 2016
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Lecture 02.
PCB 3043L - General Ecology Data Analysis.
Section 2: Science as a Process
Statistical Data Analysis
Chapter Three Research Design.
ECE 471/571 – Lecture 12 Decision Tree.
MIS2502: Data Analytics Classification using Decision Trees
Dr. Morgan C. Wang Department of Statistics
Introduction to Predictive Modeling
Nature of Science.
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Course Lab Introduction to IBM Watson Analytics
Why do Research? Chapter 1.
Statistical Data Analysis
Model generalization Brief summary of methods
Introduction to Science and the Scientific Method
Chap. 1: Introduction to Statistics
DESIGN OF EXPERIMENTS by R. C. Baker
Introduction to the Scientific Method
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Math 6330: Statistical Consulting Class 2 Tony Cox tcoxdenver@aol.com University of Colorado at Denver Course web site: http://cox-associates.com/6330/

Student introductions Name Affiliation (academic, professional) Technical interests Any special expertise in data analysis areas Any projects or data sets of interest Thoughts on Assignment 1? (How does PM2.5 affect elderly mortality in this data set?) Hopes and goals for course

Assignment # 1 Download data set Sample1.xlsx from http://cox-associates.com/6330/ Analyze the data to answer the following client question: “Is there evidence that high concentrations of fine particulate matter (PM2.5) increase daily elderly mortality counts (AllCause75)? If so, how large is the effect?” E-mail questions to tcoxdenver@aol.com

Assignment 2 (Due January 31) Download data set Class2DataBenzene.xlsx from http://cox-associates.com/6330/ Setting: 3 factories in China, many workers, some measured more than once (“splits”). Some missing data. Detection limit for benzene in air in 0.2 ppm. Analyze the data to answer the following client question: “Is there evidence that low concentrations of benzene in air (e.g., AB < 1) produce disproportionately more toxic metabolites (PH, CA, HQ for phenol, catechol, hydroquinone) or total urinary metabolites (UB) than higher concentrations? What is the shape of the low-concentration relation between AB and each metabolite?” Background: Who cares, and why? http://retractionwatch.com/2013/04/08/environmental-scientists-call-for-retraction-of-oil-industry-funded-paper-on-benzene-exposure/ E-mail questions to tcoxdenver@aol.com

Reminder: Goals for student projects Extract good problems from available knowledge and data “Good” = high value of analysis = large improvement in decisions, results, etc. Apply high-value techniques to produce valuable answers and insights Unexpected directions are ok! Present the results so that the potential value is actually delivered If possible, document impact and next steps

Some high-value consulting tools – Beyond clustering and regression Classification and regression trees (CART) Random Forest Bayesian networks Influence diagrams Predictive analytics Causal analytics State transition models Dynamic simulation modeling Markov Decision Processes (MDPs) Partially observable MDPs Simulation-optimization

Components of a successful project Problem statement and motivation Data Analysis plan/narrative Tools and software Results: Reports and displays Presentation: What did we learn? Evaluation: What was the impact? Proposed next steps

High-value statistical consulting skills

Components of a consulting engagement Agreed-to problem statement or question Understanding of why it matters, underlying goals, decisions, or questions Data that are relevant (maybe) for answering the question Methods: Tools, analyses, software Results and interpretation. Caveats/limitations Report to client (summarizes 1-5) Proposed next steps (usually) – Builds on 1-6

Key steps in consulting Vision: Define and agree on success – goals and measures What you measure is what you get Clarify objectives Generate alternatives Compare/evaluate alternatives Make recommendations, show why Evaluate performance

Toward higher-value analytics Reorientation: From solving well-posed problems to discovering how to act more effectively Descriptive analytics: What’s happening? Predictive analytics: What’s (probably) coming next? Causal analytics: What can we do about it? Prescriptive analytics: What should we do? Evaluation analytics: How well is it working? Learning analytics: How to do better? Collaboration: How to do better together?

High-value statistical skills Describe current situation Predict what is likely to happen next if we do not take action Predict what is likely to happen next if we take different actions Optimize decisions about what to do Evaluate how well current policies are working Learn to improve current policies

Introduction to descriptive analytics

Descriptive analytics: What’s going on? What is the current situation? Attribution: How much harm/loss/opportunity cost is being caused by X? Causes are often unobserved or uncertain What has changed recently? (Why?) Example: More extreme event reports caused by real change or by media? Change-point analysis (CPA) algorithms What should we worry about? How is this year’s season shaping up?

Air pollution example: Classification tree descriptive analytics tmin, tmax, month, year, MAXRH are potential predictors of AllCause75 (elderly mortality) PM2.5 does not appear in this tree AllCause75 is conditionally independent of PM2.5 in this analysis, given the other variables in the tree Making year and month into categorical variables changes the tree but not this conclusion.

How a CART tree works Basic idea: Always ask the most informative question next, given answers so far. Questions are represented by splits in tree Leaf nodes show conditional means (or conditional distributions) of dependent variable Internal nodes show significance level for split: how significant are differences between conditional distributions Reduces prediction error for dependent variable Stop this “recursive partitioning” when further questions (splits in tree) do not significantly improve prediction. Classification & Regression Tree (CART) algorithm Some refinements: Grow a large tree and prune back to minimize cross-validation error fit multiple trees to random subsets of data and let them vote for best splits (“bagging”) over-train on mis-predicted cases (“boosting”) average predictions from many trees (“RandomForest” ensemble prediction) Join prediction “patches” together smoothly (MARS)

Bayesian Networks (BNs) show information relations among variables BNs provides high-level roadmap for descriptive analytics Each node has a conditional probability table (CPT) (or regression model, CART tree, etc.) describing how the conditional probabilities of its values depend on other variables. If no arrow connects two variables, then they are conditionally independent of each other, given the other variables in the BN. Omitted variables can create statistical dependencies Conditioning on variables can also sometimes create dependencies Information principle for causality: Causes are not conditionally independent of their effects.