Data Mining: An Overview David Madigan

Slides:



Advertisements
Similar presentations
CSE 634 Data Mining Techniques
Advertisements

Experimental Design, Response Surface Analysis, and Optimization
Model Assessment, Selection and Averaging
What is Statistical Modeling
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Multiple regression analysis
x – independent variable (input)
Chapter 10 Simple Regression.
A Data Mining Tutorial David Madigan
Statistical Methods Chichang Jou Tamkang University.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Evaluating Hypotheses
Mining Association Rules in Large Databases
Analyzing Hospital Discharge Data David Madigan Rutgers University.
These slides are additional material for TIES4451 Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö.
Scan Statistics via Permutation Tests David Madigan.
Mining Association Rules
Mining Association Rules
Data Mining – Intro.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Chapter 5 Data mining : A Closer Look.
Classification and Prediction: Regression Analysis
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Regression and Correlation Methods Judy Zhong Ph.D.
Review of Basic Statistics. Definitions Population - The set of all items of interest in a statistical problem e.g. - Houses in Sacramento Parameter -
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining Find information from data data ? information.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
INTRODUCTION TO Machine Learning 3rd Edition
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Week 6 Dr. Jenne Meyer.  Article review  Rules of variance  Keep unaccounted variance small (you want to be able to explain why the variance occurs)
Data Mining and Decision Support
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Data Mining  Association Rule  Classification  Clustering.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Data Mining – Intro.
Data Mining Find information from data data ? information.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Mining Association Rules
Mining Association Rules in Large Databases
Association Rule Mining
©Jiawei Han and Micheline Kamber
Simple Linear Regression
A Data Mining Tutorial David Madigan
Presentation transcript:

Data Mining: An Overview David Madigan

Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Examples –Algorithms: Disease Clusters –Algorithms: Model-Based Clustering –Algorithms: Frequent Items and Association Rules Future Directions, etc.

Of “Laws”, Monsters, and Giants… Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory It’s more aggressive cousin: –Disk storage “capacity” doubles every 9 months What do the two “laws” combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it.

What is Data Mining? Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Examples of tasks addressed by Data Mining –Predictive Modeling (classification, regression) –Segmentation (Data Clustering ) –Summarization –Visualization

Ronny Kohavi, ICML 1998

Stories: Online Retailing

Chapter 4: Data Analysis and Uncertainty Elementary statistical concepts: random variables, distributions, densities, independence, point and interval estimation, bias & variance, MLE Model (global, represent prominent structures) vs. Pattern (local, idiosyncratic deviations): Frequentist vs. Bayesian Sampling methods

Bayesian Estimation e.g. beta-binomial model: Predictive distribution:

Issues to do with p-values Using threshold’s of 0.05 or 0.01 regardless of sample size Multiple testing (e.g. Friedman (1983) selecting highly significant regressors from noise) Subtle interpretation Jeffreys (1980): “I have always considered the arguments for the use of P absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the trial value was improbable; that is, that is has not predicted something that has not happened.”

p-value as measure of evidence Schervish (1996): “if hypothesis H implies hypothesis H', then there should be at least as much support for H' as for H.” - not satisfied by p-values Grimmet and Ridenhour (1996): “one might expect an outlying data point to lend support to the alternative hypothesis in, for instance, a one-way analysis of variance.” - the value of the outlying data point that minimizes the significance level can lie within the range of the data

Chapter 5: Data Mining Algorithms “A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns” “well-defined”: can be encoded in software “algorithm”: must terminate after some finite number of steps

Data Mining Algorithms “A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns” “well-defined”: can be encoded in software “algorithm”: must terminate after some finite number of steps Hand, Mannila, and Smyth

Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory)

Backpropagation data mining algorithm x1x1 x2x2 x3x3 x4x4 h1h1 h2h2 y vector of p input values multiplied by p  d 1 weight matrix resulting d 1 values individually transformed by non-linear function resulting d 1 values multiplied by d 1  d 2 weight matrix 4 2 1

Backpropagation (cont.) Parameters: Score: Search: steepest descent; search for structure?

Models and Patterns Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparamatric regression

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification logistic regression naïve bayes/TAN/bayesian networks NN support vector machines Trees etc.

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification Parametric models Mixtures of parametric models Graphical Markov models (categorical, continuous, mixed)

Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification Parametric models Mixtures of parametric models Graphical Markov models (categorical, continuous, mixed) Time series Markov models Mixture Transition Distribution models Hidden Markov models Spatial models

Markov Models First-order: e.g.: g linear  standard first-order auto-regressive model y1y1 y2y2 y3y3 yTyT

First-Order HMM/Kalman Filter x1x1 x2x2 x3x3 xTxT y1y1 y2y2 y3y3 yTyT Note: to compute p(y 1,…,y T ) need to sum/integrate over all possible state sequences...

Bias-Variance Tradeoff High Bias - Low VarianceLow Bias - High Variance “overfitting” - modeling the random component Score function should embody the compromise

The Curse of Dimensionality X ~ MVN p (0, I) Gaussian kernel density estimation Bandwidth chosen to minimize MSE at the mean Suppose want: Dimension# data points , ,000

Patterns Global Local Clustering via partitioning Hierarchical Clustering Mixture Models Outlier detection Changepoint detection Bump hunting Scan statistics Association rules

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x The curve represents a road Each “x” marks an accident Red “x” denotes an injury accident Black “x” means no injury Is there a stretch of road where there is an unusually large fraction of injury accidents? Scan Statistics via Permutation Tests

Scan with Fixed Window If we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x

How Unusual is a Window? Let p W and p ¬W denote the true probability of being red inside and outside the window respectively. Let (x W,n W ) and (x ¬W,n ¬W ) denote the corresponding counts Use the GLRT for comparing H 0 : p W = p ¬W versus H 1 : p W ≠ p ¬W  2 log here has an asymptotic chi-square distribution with 1df lambda measures how unusual a window is

Permutation Test Since we look at the smallest over all window locations, need to find the distribution of smallest- under the null hypothesis that there are no clusters Look at the distribution of smallest- over say 999 random relabellings of the colors of the x’s xx x xxx x xx x xx x xx x xxx x xx x xx x xx x xxx x xx x xx x xx x xxx x xx x xx x … smallest- Look at the position of observed smallest- in this distribution to get the scan statistic p-value (e.g., if observed smallest- is 5 th smallest, p-value is 0.005)

Variable Length Window No need to use fixed-length window. Examine all possible windows up to say half the length of the entire road O = fatal accident O = non-fatal accident

Spatial Scan Statistics Spatial scan statistic uses, e.g., circles instead of line segments

Spatial-Temporal Scan Statistics Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window

Other Issues Poisson model also common (instead of the bernoulli model) Covariate adjustment Andrew Moore’s group at CMU: efficient algorithms for scan statistics

Software: SaTScan + others

Association Rules: Support and Confidence Find all the rules Y  Z with minimum confidence and support –support, s, probability that a transaction contains {Y & Z} –confidence, c, conditional probability that a transaction having {Y & Z} also contains Z Let minimum support 50%, and minimum confidence 50%, we have –A  C (50%, 66.6%) –C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

Mining Association Rules—An Example For rule A  C: support = support({A &C}) = 50% confidence = support({A &C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) – buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? Various extensions (thousands!)

Model-based Clustering Padhraic Smyth, UCI

Mixtures of {Sequences, Curves, …} Generative Model - select a component c k for individual i - generate data according to p(D i | c k ) - p(D i | c k ) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(D i | c k ), we can define an EM algorithm]

Example: Mixtures of SFSMs Simple model for traversal on a Web site (equivalent to first-order Markov with end-state) Generative model for large sets of Web users - different behaviors mixture of SFSMs EM algorithm is quite simple: weighted counts

WebCanvas: Cadez, Heckerman, et al, KDD 2000

Discussion What is data mining? Hard to pin down – who cares? Textbook statistical ideas with a new focus on algorithms Lots of new ideas too

Privacy and Data Mining Ronny Kohavi, ICML 1998

Analyzing Hospital Discharge Data David Madigan Rutgers University

Comparing Outcomes Across Providers Florence Nightingale wrote in 1863: “In attempting to arrive at the truth, I have applied everywhere for information, but in scarcely an instance have I been able to obtain hospital records fit for any purposes of comparison…I am fain to sum up with an urgent appeal for adopting some uniform system of publishing the statistical records of hospitals.”

Data Data of various kinds are now available; e.g. data concerning all medicare/medicaid hospital admissions in standard format UB-92; covers >95% of all admissions nationally Considerable interest in using these data to compare providers (hospitals, physician groups, physicians, etc.) In Pennsylvannia, large corporations such as Westinghouse and Hershey Foods are a motivating force and use the data to select providers.

SYSIDDCSTATUSPPXDOWCANCER1 YEARLOSSPX1DOWCANCER2 QUARTERDCHOURSPX2DOWMDCHC4 PAFDCDOWSPX3DOWMQSEV HREGIONECODESPX4DOWMQNRSP MAIDPDXSPX5DOWPROFCHG PTSEXSDX1REFIDTOTALCHG ETHNICSDX2ATTIDNONCVCHG RACESDX3OPERIDROOMCHG PSEUDOIDSDX4PAYTYPE1ANCLRCHG AGESDX5PAYTYPE2DRUGCHG AGECATSDX6PAYTYPE3EQUIPCHG PRIVZIPSDX7ESTPAYERSPECLCHG MKTSHARESDX8NAICMISCCHG COUNTYPPXOCCUR1APRMDC STATESPX1OCCUR2APRDRG ADTYPESPX2BILLTYPEAPRSOI ADSOURCESPX3DRGHOSPAPRROM ADHOURSPX4PCMUMQGCLUST ADMDXSPX5DRGHC4MQGCELL ADDOW Pennsylvannia Healthcare Cost Containment Council , n=800,000

Risk Adjustment Discharge data like these allow for comparisons of, e.g., mortality rates for CABG procedure across hospitals. Some hospitals accept riskier patients than others; a fair comparison must account for such differences. PHC4 (and many other organizations) use “indirect standardization”

Hospital Responses

p-value computation n=463; suppose actual number of deaths=40 e=29.56; p-value = p-value < 0.05

Concerns Ad-hoc groupings of strata Adequate risk adjustment for outcomes other than mortality? Sensitivity analysis? Hopeless? Statistical testing versus estimation Simpson’s paradox

Risk Cat.NRateActual Number Expected Number Low8001%88 (1%) High2008%1610 (5%) Low2001%22 (1%) High8008%6440 (5%) A B SMR = 24/18 = 1.33; p-value = 0.07 SMR = 66/42 = 1.57; p-value =

Hierarchical Model Patients -> physicians -> hospitals Build a model using data at each level and estimate quantities of interest

Bayesian Hierarchical Model MCMC via WinBUGS

Goldstein and Spiegelhalter, 1996

Discussion Markov chain Monte Carlo + compute power enable hierarchical modeling Software is a significant barrier to the widespread application of better methodology Are these data useful for the study of disease?