DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Copyright 2004 David J. Lilja1 Comparing Two Alternatives Use confidence intervals for Before-and-after comparisons Noncorresponding measurements.
Multi‑Criteria Decision Making
Component Analysis (Review)
CHAPTER 24 MRPP (Multi-response Permutation Procedures) and Related Techniques From: McCune, B. & J. B. Grace Analysis of Ecological Communities.
Combining Test Data MANA 4328 Dr. Jeanne Michalski
PSY 307 – Statistics for the Behavioral Sciences Chapter 20 – Tests for Ranked Data, Choosing Statistical Tests.
ANOVA: Analysis of Variation
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Management Science
Copyright © 2006 Pearson Education Canada Inc Course Arrangement !!! Nov. 22,Tuesday Last Class Nov. 23,WednesdayQuiz 5 Nov. 25, FridayTutorial 5.
Statistics for the Social Sciences
Reduced Support Vector Machine
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 17: Nonparametric Tests & Course Summary.
MENENTUKAN LOKASI PABRIK YANG IDEAL MENGGUNAKAN AHP PERTEMUAN 12.
Introduction to Management Science
One-way Between Groups Analysis of Variance
Incomplete Block Designs
1 Multi-Criteria Decision Making MCDM Approaches.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
MADM Y. İlker TOPCU, Ph.D twitter.com/yitopcu.
Quantifying Data.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Constructing the Decision Model Y. İlker TOPCU, Ph.D twitter.com/yitopcu.
9-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Multicriteria Decision Making Chapter 9.
Multicriteria Decision Making
9-1 Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Multicriteria Decision Making Chapter 9.
Presented by Johanna Lind and Anna Schurba Facility Location Planning using the Analytic Hierarchy Process Specialisation Seminar „Facility Location Planning“
LINEAR PROGRAMMING SIMPLEX METHOD.
The basic idea So far, we have been comparing two samples
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Multi-Criteria Decision Making by: Mehrdad ghafoori Saber seyyed ali
Chapter 9 - Multicriteria Decision Making 1 Chapter 9 Multicriteria Decision Making Introduction to Management Science 8th Edition by Bernard W. Taylor.
Business Location Decisions Dr. Everette S. Gardner, Jr.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Multi-Criteria Decision Making
Multi-Criteria Analysis - preference weighting. Defining weights for criteria Purpose: to express the importance of each criterion relative to other criteria.
Combining Test Data MANA 4328 Dr. Jeanne Michalski
Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
Maximizing value and Minimizing base on Fuzzy TOPSIS model
Analyzing the Problem (SAW, WP, TOPSIS) Y. İlker TOPCU, Ph.D twitter.com/yitopcu.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Statistics for Political Science Levin and Fox Chapter Seven
Air Pollution Research Group Analysis of 1999 TRI Data to Identify High Environmental Risk Areas of Ohio by Amit Joshi.
Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.
Customer Satisfaction Index July 2008 RESULTS. Introduction This report presents the results for the Customer Satisfaction Index survey undertaken in.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
ANOVA: Analysis of Variation
Unsupervised Learning
ANOVA: Analysis of Variation
ANOVA: Analysis of Variation
ANOVA: Analysis of Variation
Regression Analysis Module 3.
Supplement S7 Supplier Selection.
BIKAS K SINHA ISI, Kolkata [Retired Professor] &
Statistics: The Z score and the normal distribution
Regression Chapter 6 I Introduction to Regression
Analytic Hierarchy Process (AHP)
Multidimensional Scaling and Correspondence Analysis
Chapter 11 Analysis of Variance
Measures in Variability
Selecting a Solution Path
IME634: Management Decision Analysis
CHAPTER 2: Basic Summary Statistics
Multicriteria Decision Making
Correspondence Analysis
Numerical Statistics Measures of Variability
Unsupervised Learning
Presentation transcript:

DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012

What is Data Integration ? Integration of Multiple Indicators Existence of several different indicators Desired to provide an AGGREGATE OR Over-all Measure….in an objective and statistically sound approach Multiple Criteria Decision Making [MCDM] Advocated by Hwang & Yoon (1981) : Multiple Attribute Decision Making : Methods & Applications : A State-of-the- Art-Survey. Springer-Verlag, Berlin

HEALTH ISSUES Air / Surface / Water Pollution : Different Sources & Their Effects ******************* US EPA : TRI Data Base Toxic Release Inventory [TRI] Data EPA’s 33/50 Program TRI Data for 17 Chemicals during long years : for 50 States & DC

TOXIC RELEASE INVENTORY [TRI] : US EPA TOXIC CHEMICALS…… BENZENE CADMIUM CARBON TETRACHLORIDE CHOLOFORM CYANIDE LEAD MERCURY NICKEL TOLUENE M-XYLENE… TRI Data…..expressed as % … Less the Better….More the Worse

NATURE OF DATA & PROBLEM States VS Chemicals : TRI Data [Coded] Benzene I 7% II 12% Q. Which State is the Least III 17% Hit by Benzene ? IV 9% Ans. VI V 14% AND Worst Hit ? III VI 6% Single Chemical….. VII 15% NO PROBLEM AT ALL VIII 16% TO RANK THE STATES FROM BEST TO WORST...

ADD ONE MORE CHEMICAL... States VS Chemicals : TRI Data Benzene CADMIUM I 7% 13 % II 12% 9% III 17% 4% IV 9% 11% V 14% 10% VI 6% 11% VII 15% 9% VIII 16% 11% Q. Combine the Two Chemicals : Which State is Worst ? How to Combine ?

AND ADD MORE…… States \ Chemicals [TRI Data] Be Cd Ca Tr Ch Cy … I : 7% 13% 21% 2% 34% 21% … CONCEPT OF DATA MATRIX X = (( X iJ )), 1  i  K; 1  j  N K Locations & N Data Sources DATA INTEGRATION FOR OVER-ALL TRI INDEX FOR GLOBAL COMPARISON

Data Matrix States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25%

Application Areas Disease Prevalence Statistics Disease Symptom Statistics Health Statistics Demographic Statistics Human Development Index Statistics *********** Data Integration : Common Problem Techniques are quite general ….

Nature of Data Locations versus Features : Quantitative data providing impacts of features on the locations based on similarity principle Purpose : Overall Ranking of Locations based on Combined Evidence from a Pool of Features Features may / may not have equal importance in the process of ‘combining evidences’

Aggregate Methods….. Some kind of “aggregate” …..pooling of TRI Data to a single value for each State for over-all comparison TRI Data Total TRI for I = 115 [over 7 features] Average TRI for I = 16.43% Compute Average for Each State & Compare the averages across all states

Aggregate Methods…. AM…..GM…..HM….. Use of Median as Representative of TRI TRI Data I…Median = 17% II…Median = 12% III….19% ETC ETC….. Q. ARE ALL CHEMICALS EQUALLY HARMFUL ? Ans. Possibly NOT ! Q. Are all Features Equally Important ?

Concept of Weight….. Subject Specialist’s Knowledge….. Choice of Weights : Rel. Importance Wts.[TRI] : Interpretation of weights….. Total of Weights = 25.0 Rel. Wts : 2.0/25 = 8%, 3.5/25 =14% etc etc….for all chemicals….. Total of Rel Wts. = 1 OR 100 % Use Rel. Wts. to compute Weighted AM, GM, HM etc

Use of Ranks….. Convert Scores into Ranks for Each Item TRI Data Matrix :Convert into Rank Matrix Benzene Cadmium etc etc TRI Scores Ranks I 7% ……...2 II 12% …….4 III 17% …….8 IV 9% ……...3 V 14% …….5 VI 6% …..1 VII 15% …...6 VIII 16% …...7

Rank Matrix…. States VS Chemicals : TRI Data ….ranks Be Cd Ca Tr Ch Cy L I 2 II 4 III 8 IV 3 etc etc etc V 5 for each chemical VI 1 VII 6 Then use “aggregate” methods VIII 7 based on ranks

Why Ranks ….. Raw Scores…… Aggregate methods are sensitive to Outliers….too high or too low values… Extreme Values…. Use of Trimmed Mean Ranking…..recommended for Robust Results……

Less Known Methods…. TOPSIS METHOD ELECTRE METHOD [computation-intensive…..] Concepts : TOPSIS Method Features…..Locations…. Ideal Location Anti-Ideal Location Distance from Ideal….from Anti-ideal Within Feature Variation Composite Index

TOPSIS METHOD Technique for Ordering Preferences by Similarity to Ideal Solution Uses Concepts of Ideal & Anti-Ideal Locations Distance from Ideal & Anti-Ideal Locations Weight of Features Sum of Squares for each feature

Philosophy for TOPSIS TOPSIS (Technique for Ordering Preferences by Similarity to Ideal Solution) In the absence of a natural course of action for over-all summary measure and ranking….next best alternative course of action would be to assign top rank to the one which has shortest distance from the ideal and farthest distance from the anti-ideal…..

Concepts: Ideal & Anti-Ideal States States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25% ********************************************************* Ideal... 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal

Ideal & Anti-Ideal…… Hypothetical Locations! Abs. Best / Worst States ……Hypothetical Setting up the Limits for others….. Ranking of the others….. Better - Placed States ? Closer to Ideal : Distance from Ideal….small AND ALSO Far from Anti-Ideal : Distance from Anti-Ideal…Large

Concepts of Distance….. Euclidean Distance….. Ideal : 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal Squarred Distance between Ideal & Its Anti = (6-17)^2 + (4-13)^2 + …. + (11-28)^2

Computations…. Distance between Location & ID OR NID…. I : 7% 13% 21% 2% 34% 21% 17% ID: 6% 4% 10% 2% 21% 19% 11% NID 17% 13% 23% 8% 42% 28% 28% Sq.Dis. [ I vs ID] (7- 6)^2 =1, (13 – 4)^2 =81, …… Sq. Dis. [ I vs NID] (7-17)^2 =100, (13 – 13)^2=0, … ….

Sq. Dist. Comp. vs Ideal Features\ Chemicals Locations I II III IV V VI VII VIII

Sq. Dist. Comp. vs Anti-Ideal Features\ Chemicals Locations I II III IV V VI VII VIII

Choice of Weights….. Wts.[TRI] : Rel. Wts : Sum of Squares for each feature over all locations Be : 7^2+12^2+17^2+9^2+14^2+6^2 +15^2+16^2 = 1236

Computation of Feature-wise Sum of Squares Features \ Sum of Squares 1[Be] [Cd] [Ca] [Tr] [Ch] [Cy] [L] 3818

Formation of Composite Indices…. Ingredients Distances, Weights & Sum of Squares Composite Index [CI] : 2 Components derived from Ideal & Anti-Ideal locations For Each Location : Added over all Features Sq.Distance x Wt of Feature Divided by Sum of Squares of feature

Two Components…

u’s and v’s….min. & max.

Details of Computations….. State I : L 2 [I, IDR] = [(7- 6)^2 x 0.08 / ….] 1/2 L 2 [I, NIDR =[ (7-17)^2 x 0.08 / …] 1/2 CI = Composite Index = L 2 [I, IDR] / {L 2 [I, IDR} + L 2 [I,NIDR]} It is a RATIO between 0 and 1 ….smaller the ratio, better is the placement of the State in over-all comparison across states …..

Computational Details : Ideal Locations / Features for Sq Distance wrt Ideal x Weight / SS of Features I II III IV V VI VII VIII

Computational Details : Anti-Ideal Locations / Features for Sq Dis. wrt Anti-Ideal x Weight / SS of Features I II

Final Ranking Table… States L 2 [., IDR] L 2 [., NIDR] CI Rank I II III IV etc etc etc V VI VII VIII

Choice of Weights…. Internal & External Importance of Environmental factors…. Use of Shanon’s Entropy Measure  …. Define p iJ = X iJ /  i X iJ = proportion…. Compute for each item  (J) = -  i p iJ ln p iJ / ln (K) Use w(J) = (1 -  (J)) /  r (1-  (r)) Alternatively….use w(J) proportional to cv 2 of Item J …coeff of variation [cv] computed from the data matrix…..

Extensions….. Ranking depends critically on Choice of Distance Measure & Choice of Weights Distance Measure : Squared Distance [L 2 ] Mean Deviation : L 1 – Norm

Reversal of Roles of Rows & Cols.

Pollution Data on 50 US States

US Pollution Data [contd.]....

US Pollution Data [contd.]….

Results of TOPSIS Analysis : Two Sets of Weights [Entropy & CV] & Two Distance Measures [L-1 & L-2]

Results …contd…..

Results…contd….

Questions….. Q1. Can the indicators be expressed in original units of measurement or we only need % ? Ans. Yes….original units will do since the formulae indicate unit-free computations. Also see US Original Pollution Data at the end. Q2. What about interdependence among the indicators ?

Questions…. Ans. It is believed that the indicators are seemingly uncorrelated. If there is any functional dependence, only the smallest subset of them should be used. Q3. What about PCA ? Ans. That won’t lead to ranking of the locations. Also it will be difficult to interpret the linear combinations of the indicators.