Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science.

Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science

Outline Motivation Main data mining techniques: –Constrained Association Rules –OLAP Exploration and Analysis Other classical techniques: –Linear Regression –PCA –Naïve Bayes –K-Means –Bayesian Classifier 2/45

Motivation: why inside a DBMS? DBMSs offer a level of security unavailable with flat files. Databases have built-in features that optimize extraction and simple analysis of datasets. We can increase the complexity of these analysis methods while still keeping the benefits offered by the DBMS. We can analyze large amounts of data in an efficient manner. 3/45

Our approach Avoid exporting data outside the DBMS Exploit SQL and UDFs Accelerate computations with query optimization and pushing processing into main memory 4/45

Constrained Association Rules Association rules – technique for identifying patterns in datasets using confidence Looks for relationships between the variables Detects groups of items that frequently occur together in a given dataset Rules are in the format X => Y The set of items X are often found in conjunction with the set of items Y 5/45

The Constraints Group Constraint Determines which variables can occur together in the final rules Item Constraint Determines which variables will be used in the study Allows the user to ignore some variables Antecedent / Consequent Constraint Determines the side of the rule that a variable can appear on 6/45

Experiment Input dataset: p=25, n=655 Three types of Attributes: –P: perfusion measurements –R: risk factor –D: heart disease measurements 7/45

Experiments This table summarizes the impact of constraints on number of patterns and running time. 8/45

Experiments This Figure shows rules predicting no heart disease in groups. 9/45

Experiments This figure shows groups of rules predicting heart disease. 10/45

Experiments These figures show some selected cover rules, predicting absence or existence of disease. 11/45

OLAP Exploration and Analysis Definition: –Input table F with n records –Cube dimension: D={D 1,D 2,…D d } –Measure dimension: A={A 1,A 2,…A e } –In OLAP processing, the basic idea is to compute aggregations on measure A i by subsets of dimensions G, G  D. 12/45

OLAP Exploration and Analysis Example: –Cube with three dimensions (D 1,D 2,D 3 ) –Each face represents a subcube on two dimensions –Each cell represent subcube on one dimension 13/45

OLAP Statistical Tests We proposed the use of statistical tests on pairs of OLAP sub cubes to analyze their relationship Statistical Tests allow us to mathematically show that a pair of sub cubes are significantly different from each other 14/45

OLAP Statistical Tests The null hypothesis H 0 states  1 =  2 and the goal is to find groups where H 0 can be rejected with high confidence 1-p. The so called alternative hypothesis H 1 states  1  2. We use a two-tailed test which allows finding a significant difference on both tail of the Gaussian distribution in order to compare means in any order (  1  2 or  2  1 ). The test relied on the following equation to compute a random variable z. 15/45

Experiments n = 655 d = 21 e = 4 Includes patient information, habits, and perfusion measurements as dimensions Measures are the stenosis, or amount of narrowing, of the four main arteries of the human heart 16/45

Experiment Evaluation Heart data set: Group pairs with significant measure differences at p=0.01 17/45

Experiment Evaluation Summary of medical result at p=0.01 The most important is OLDYN, SEX and SMOKE. 18/45

Comparing Reliability of OLAP Statistical Tests and Association Rules Both techniques altered to bring on same plane for comparison –Association Rules: added post process pairing –OLAP Statistical Tests: added constraints Cases under study –Association Rules (HH) – both rules have high confidence AdmissionAfterOpen(1), AorticDiagnosis(0/1)=>NetMargin(0/1) High confidence, but also high p-value Data is crowded around AR boundary point 19/45

Comparing Reliability of OLAP Statistical Tests and Association Rules Association Rules: High/High –We can see that the data is crowded around boundary point for Association Rules –Two Gaussians are not significantly different –Conclude: both agree, OLAP Statistical Tests is more reliable 20/45

Comparing Reliability of OLAP Statistical Tests and Association Rules Association Rules: Low/Low –Once again boundary point comes into play –Two Gaussians are not significantly different –Conclude: both agree 21/45

Comparing Reliability of OLAP Statistical Tests and Association Rules Association Rules: High/Low –Ambiguous 22/45

Results from TMHS dataset Mainly financial dataset –Revolves around opening of a new medical center for treating heart patients Results from Association Rules –Found 4051 rules with confidence>=0.7 and support>=5% –AfterOpen=1, Elder=1 => Low Charges After the center opened, the elderly enjoyed low charges –AfterOpen=0, Elder=1 => High Charges Before the center opened, the elderly was associated with high charges Results from OLAP Statistical Tests –Found 1761 pairs with p-value =5% –Walk-in, insurance (commercial/medicare) => charges(high/low) Amount of total charges to patient depends on his/her insurance when the admission source is a walk-in –AorticDiagnosis=0, AdmissionSource (Walk-in / Transfer) => lengthOfStay (low / high) If diagnosis is not aortic disease, then the length of stay depends on how the patient was admitted. 23/45

Machine Learning techniques PCA Regression: Linear and Logistic Naïve Bayes Bayesian classification 24/45

Principal Component Analysis Dimensionality reduction technique for high- dimensional data (e.g. microarray data). Exploratory data analysis, by finding hidden relationships between attributes. Assumptions: –Linearity of the data. –Statistical importance of mean and covariance. –Large variances have important dynamics. 25/45

Principal Component Analysis Rotation of the input space to eliminate redundancy. Most variance is preserved. Minimal correlation between attributes. U T X is a new rotated space. Select the k th most representative components of U. (k<d) Solving PCA is equivalent to solve SVD, defined by the eigen-problem: X=UEV T XX T =UE 2 U T 26/45 U: left eigenvectors E: the eigenvalues V: the right eigenvectors

PCA Example U1U2U3U4U5U6U7U8 age0.3930.223 -0.2590.195-0.405 gender-0.2930.454-0.413 on_thyroxine-0.161 0.2320.608-0.226-0.1000.162 query_thyroxine0.229-0.100-0.397 0.4460.184 on_antithyroid_med0.1070.221-0.1750.3270.447 -0.204 sick0.171 0.019 0.1310.208-0.846 pregnant 0.138 -0.188 -0.194 surgery 0.246 -0.276 I131_treatment-0.108-0.2140.107 0.3290.360-0.059 query_hypothyroid -0.1570.136 0.294-0.573-0.123-0.251 query_hyperthyroid-0.2230.107 -0.1290.189 lithium-0.1340.1590.4210.2170.2470.3190.216 goitre-0.100-0.1740.166-0.4300.2360.278-0.178 tumor0.384-0.151-0.108 0.109-0.1100.6970.155 hypopituitary -0.1560.195-0.230-0.523 -0.155 psych0.118-0.604 0.459 -0.276 27/45

PCA Example 28/45 U1U2U3U4U5U6U7U8 age0.102 chol0.1310.1750.1980.156 0.1050.275 claudi0.1730.2730.2520.220 -0.4080.2750.194 diab-0.273 0.2610.3050.353 0.217 fhcad 0.144-0.4200.266-0.193-0.2390.326 gender-0.4090.347-0.106 0.379 0.1080.393 hta-0.128-0.1220.138 0.109-0.152-0.105-0.110 hyplpd0.217 0.1830.195-0.204 -0.154 pangio-0.103-0.111 -0.3470.224-0.311-0.415 pcarsur 0.286 -0.318 -0.117-0.2170.370 pstroke 0.4490.138-0.157-0.4480.263 0.152 smoke 0.159-0.3230.417 0.123-0.464 lad0.3710.504 -0.1030.342-0.170-0.160-0.516 lcx0.572-0.135 0.4480.4220.1600.205 lm-0.2880.2210.301 0.2940.301-0.313 rca0.184-0.156 -0.329-0.141-0.4090.2100.142

Linear Regression There are two main applications for linear regression: Prediction or forecasting of the output or variable of interest Y Fit a model from the observed Y and the input variables X. For values of X given without its accompanying value of Y, the model can be used to make a prediction of the output of interest Y. Given an input data X={x 1,x 2,…,x n }, with d dimensions X a, and the response or variable of interest Y. Linear regression finds a set of coefficients β to model: Y = β 0 +β 1 X 1 +…+β d X d + ɛ. 29/45

Linear Regression with SSVS Bayesian variable selection Quantify the strength of the relationship between Y and a number of explanatory variables X a. Assess which Xa may have no relevant relationship with Y. Identify which subsets of the Xa contain redundant information about Y. The goal is to find the subset of explanatory variables X γ which best predicts the output Y, with the regression model Y = β γ X γ + ɛ. We use Gibbs sampling, which is an MCMC algorithm, to estimate the probability distribution π(γ|Y,X) of a model to fit the output variable Y. Other techniques, like stepwise variable selection, perform a partial search to find the model that better explains the output variable. Stochastic Search Variable Selection finds best “likely” subset of variables based on posterior probabilities. 30/45

Linear Regression in the DBMS Bayesian variable selection is implemented completely inside the DBMS with SQL and UDFs for efficient use of memory and processor resources. Our algorithms and storage layouts for tables in the DBMS have a representative impact on execution performance. Compared to the statistical package R, our implementations scale to large data sets. 31/45

Linear regression: Experimental results Parameters Variables: 21 n = 655 Y: rca c = 100 it = 10000 burn =1000 GammaProbrSquared 0,1,3,8,12,13,16,190.0123330.826227 0,1,3,8,12,130.0117780.838421 0,1,3,6,8,12,130.0115560.832125 0,1,3,6,8,12,13,170.0103330.826885 0,1,3,8,9,12,13,16,190.0088890.821647 0,1,3,6,8,9,12,130.0080.826993 0,1,3,8,12,13,170.0072220.833006 0,1,3,6,8,13,170.0068890.833852 0,1,3,6,8,9,130.0067780.838573 0,1,3,6,8,9,12,13,170.0065560.821839 32/45 Parameters Variables: 21 n = 655 Y: lad c = 100 it = 10000 burn =1000 GammaProbrSquared 0,1,14,180.0615560.768594 0,1,13,14,180.0285560.7652 0,1,8,14,180.0228890.765396 0,1,9,14,180.0144440.766478 0,1,6,14,180.0132220.766782 0,1,3,14,180.0116670.767118 0,1,14,16,180.0101110.767645 0,1,14,17,180.010.767105 0,1,14,18,210.0086670.768276 0,1,8,13,14,180.0083330.762457 VariablesGamma age1 chol2 claudi3 diab4 fhcad5 gender6 hta7 hyplpd8 pangio9 pcarsur10 pstroke11 smoke12 il13 ap14 al15 la16 as_17 sa18 li19 si20 is_21

Linear regression: Experimental results GammaProbabilityrSquared 0,3,4,52,99,196,287,1833,1857,2115,2563,2601,3720,3924,4854,48790.7612390.00664 0,3,4,52,99,196,287,1833,1857,2563,2601,3924,4854,48790.1088910.006756 0,3,4,52,99,196,287,1833,1857,2115,2563,2601,3924,4854,48790.0509490.006702 0,3,4,52,99,196,287,1833,3924,4854,48790.0419580.006771 0,3,4,52,99,196,287,1833,2563,2601,3924,4854,48790.0279720.006758 0,3,4,52,99,196,287,1833,48540.0029970.006836 0,3,4,52,99,196,287,1833,4854,48790.0019980.006776 0,3,4,52,99,196,287,1833,2601,3924,4854,48790.0019980.006758 0,3,4,99,196,287,1833,48540.0009990.006924 33/45 Parameters d(γ 0 )1 dimensions4918 n295 iterations1000 c1 yCens Cancer microarray data, where gamma are the gene numbers.

Logistic Regression 34/45 Similar to linear regression. The data is fitted to a logistic curve. This technique is used for the prediction of probability of occurrence of an event. P(Y=1|x) = π(x) π(x) =1/(1+e -g(x) ), where g(x)= β 0 +β 1 X 1 +β 2 X 2 +…+β d X d

Logistic Regression: Experimental results 35/45 NameCoefficientNameCoefficient Intercept-2.191237293LI-0.090759713 AGE0.035740648LA-0.210152957 SEX0.40150077AP0.600745945 HTA0.279865571AS_0.264413463 DIAB0.060630279SA0.342609744 CHOL0.001882748SI0.04750216 SMOKE0.31437235IS_-0.159692182 AL0.198138067IL0.446180853 Model: med655 Train n = 491 d = 15 y = LAD>=70% Test n = 164 Accuracy GlobalClass-0Class-1 med655707467

Naïve Bayes (NB) Naïve Bayes is one of the most popular classifiers Easy to understand. Produces a simple model structure. It is robust and has a solid mathematical background. Can be computed incrementally. Classification is achieved in linear time. However, it has an independence assumption. 36/45

Bayesian Classifier Why Bayesian: –A Bayesian Classifier Based on Class Decomposition Using EM Clustering. –Robust models with good accuracy and low over-fit. –Classifier adapted to skewed distributions and overlapping set of data points by building local models based on clusters. –EM Algorithm used to fit the mixtures per class. –Bayesian Classifier is composed of a mixture of k distributions or clusters per class. 37/45

Bayesian Classifier Based on K-Means (BKM) Motivation –Bayesian Classifiers are accurate and efficient. –A Generalization of the Naïve Bayes algorithm. –Model accuracy can be tuned varying number of clusters, setting class priors and making a probability- based decision. –EM is a distance based clustering algorithm. –Two phases involved in building the predictive model Building the predictive model. Scoring a new data set based on the computed predictive model. 38/45

Example Medical Dataset is used with 655 rows n with varying number of clusters k. This Dataset has 25 dimensions d which includes diseases to be predicted, risk factors and perfusion measurements. Dimensions having null values have been replaced with the mean of that dimension. Here, we predict accuracy for LAD, RCA (2 diseases). Accuracy is good for maximum k = 8. 39/45

Example: medical med655 n = 655 d = 15 g= 0,1 G represents if the patient developed heart disease or not. wbcancer n = 569 d = 7 g= 0,1 G represents if the cancer is benign or malignant. Features describe the characteristics of cell nuclei obtained from image of breast mass. 40/45 Accuracy GlobalClass-0Class-1 med655NB678353 BKM625370 wbcancerNB939195 BKM938497

BKM & NB Models 41/45 gMEAN_VARAGESEXHTACHOLSMOKE 0MEAN58.60.640.4219.470.57 0VAR147.920.230.241497.450.25 1MEAN63.90.740.45218.340.62 1VAR128.50.190.25957.690.24 gMEAN_VAR x3x5x12x18x26 0MEAN115.710.11.20.020.37 0VAR438.7200.263.35E-050.03 1MEAN78.180.091.220.010.18 1VAR136.4500.373.58E-050.01 NB: med655 NB: wbcancer BKM: med655 gjAGESEXHTACHOLSMOKE 014.4900.975.31.82 024.362.081.075.490.48 035.090.081.256.350.21 045.12.080.375.591.78 116.281.750.966.972.06 126.451.310.746.980 134.641.820.887.242.06 144.71.751.037.040 BKM: wbcancer gjx3x5x12x18x26 016.568.272.12.972.8 025.447.322.022.071.63 034.688.942.182.463.12 045.428.374.183.891.79 116.296.122.120.961.06 126.977.122.163.073.59 135.927.832.451.91.74 147.496.681.481.492.02

Cluster Means and Weights 42/45 Means are assigned around the global mean based on Gaussian initialization. Table below shows means of clusters having 9 dimensions (d). The weight of a cluster is given by 1.0/k, where k is the number of clusters. ClassMeansWeight AGESEXDIABHYPLPDFHCADSMOKECHOLLAAP 0 600.7210.209 0.1160.698185-0.178-0.3310.0754 0 76.50.6320.080.4880.0560.488223-0.225-0.370.219 0 42.20.7540.0290.6670.2610.58224-0.505-0.7150.121 0 65.10.7530.1930.6020.09040.566223-0.22-0.3750.291 0 56.50.6520.2610.2170.2610.565139-0.379-0.5270.0404 0 54.20.7290.1320.5830.1040.66223-0.26-0.5190.253 1 51.90.5330.20.9330.2670.7332690.0233-0.5770.176 1 59.70.333 0.88900.667318-0.494-0.7480.212 1 480.40.20.80.20.8201-0.68-0.4620.0588 1 67.10.4440.2220.8890.1110.593252-0.474-0.6450.318 1 530.501 0.75456-0.5120.0471 1 72.70.750.3130.43800.625202-0.782-0.2290.188

Prediction of Accuracy Varying k (Same Clusters k per Class) 43/45 Dimensions = 21 (Perfusion Measurements + Risk factors) Accuracy for LADAccuracy for RCA k = 265.8% 66.5% k = 467.90%68.82% k = 669.89%70.42% k = 875.11%72.67% k = 1068.35%70.23% Dimensions=9 (Perfusion Measurements) Accuracy for LADAccuracy for RCA k = 273.13%67.63% k = 473.37%67.90% k = 674.80%69.80% k = 877.07%72.06% k = 1072.34%68.93%

The DBMS Group Students: –Zhibo Chen –Carlos Garcia-Alvarado –Mario Navas –Sasi Kumar Pitchaimalai –Ahmad Qwasmeh –Rengan Xu –Manish Limaye 44/45

Publications 1.Ordonez C., Chen Z., Evaluating Statistical Tests on OLAP Cubes to Compare Degree of Disease, IEEE Transactions on Information Technology in Biomedicine 13(5): 756-765 (2009) 2.Chen Z., Ordonez C., Zhao K., Comparing Reliability of Association Rules and OLAP Statistical Tests. ICDM Workshops 2008: 8-17 3.Ordonez, C., Zhao, K., A Comparison between Association Rules and Decision Trees to Predict Multiple Target Attributes, Intelligent Data Analysis (IDA), to appear in 2011. 4.Navas, M., Ordonez, C., Baladandayuthapani, V., On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs, IEEE ICDM Conference, 2010 5.Navas, M., Ordonez, C., Baladandayuthapani, V., Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs, Proc. ACM KDD Workshop on Large-scale Data Mining: Theory and Applications (LDMTA), 2010 6.Ordonez C., Pitchaimalai, S.K. Bayesian Classifiers Programmed in SQL, IEEE Transactions on Knowledge and Data Engineering (TKDE) 22(1): 139-144 (2010) 7.Pitchaimalai, S.K., Ordonez, C., Garcia-Alvarado, C., Comparing SQL and MapReduce to compute Naive Bayes in a Single Table Scan, Proc. ACM CIKM Workshop on Cloud Data Management (CloudDB), 2010 8.Navas M., Ordonez C., Efficient computation of PCA with SVD in SQL. KDD Workshop on Data Mining using Matrices and Tensors 2009 45/45

Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science.

Similar presentations

Presentation on theme: "Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science.

Similar presentations

Presentation on theme: "Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback