Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012.

Similar presentations


Presentation on theme: "DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012."— Presentation transcript:

1 DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012

2 What is Data Integration ? Integration of Multiple Indicators Existence of several different indicators Desired to provide an AGGREGATE OR Over-all Measure….in an objective and statistically sound approach Multiple Criteria Decision Making [MCDM] Advocated by Hwang & Yoon (1981) : Multiple Attribute Decision Making : Methods & Applications : A State-of-the- Art-Survey. Springer-Verlag, Berlin

3 HEALTH ISSUES Air / Surface / Water Pollution : Different Sources & Their Effects ******************* US EPA : TRI Data Base Toxic Release Inventory [TRI] Data EPA’s 33/50 Program TRI Data for 17 Chemicals during long years :1987-1994 for 50 States & DC

4 TOXIC RELEASE INVENTORY [TRI] : US EPA TOXIC CHEMICALS…… BENZENE CADMIUM CARBON TETRACHLORIDE CHOLOFORM CYANIDE LEAD MERCURY NICKEL TOLUENE M-XYLENE… TRI Data…..expressed as % … Less the Better….More the Worse

5 NATURE OF DATA & PROBLEM States VS Chemicals : TRI Data [Coded] Benzene I 7% II 12% Q. Which State is the Least III 17% Hit by Benzene ? IV 9% Ans. VI V 14% AND Worst Hit ? III VI 6% Single Chemical….. VII 15% NO PROBLEM AT ALL VIII 16% TO RANK THE STATES FROM BEST TO WORST...

6 ADD ONE MORE CHEMICAL... States VS Chemicals : TRI Data Benzene CADMIUM I 7% 13 % II 12% 9% III 17% 4% IV 9% 11% V 14% 10% VI 6% 11% VII 15% 9% VIII 16% 11% Q. Combine the Two Chemicals : Which State is Worst ? How to Combine ?

7 AND ADD MORE…… States \ Chemicals [TRI Data] Be Cd Ca Tr Ch Cy … I : 7% 13% 21% 2% 34% 21% … CONCEPT OF DATA MATRIX X = (( X iJ )), 1  i  K; 1  j  N K Locations & N Data Sources DATA INTEGRATION FOR OVER-ALL TRI INDEX FOR GLOBAL COMPARISON

8 Data Matrix States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25%

9 Application Areas Disease Prevalence Statistics Disease Symptom Statistics Health Statistics Demographic Statistics Human Development Index Statistics *********** Data Integration : Common Problem Techniques are quite general ….

10 Nature of Data Locations versus Features : Quantitative data providing impacts of features on the locations based on similarity principle Purpose : Overall Ranking of Locations based on Combined Evidence from a Pool of Features Features may / may not have equal importance in the process of ‘combining evidences’

11 Aggregate Methods….. Some kind of “aggregate” …..pooling of TRI Data to a single value for each State for over-all comparison TRI Data Total TRI for I = 115 [over 7 features] Average TRI for I = 16.43% Compute Average for Each State & Compare the averages across all states

12 Aggregate Methods…. AM…..GM…..HM….. Use of Median as Representative of TRI TRI Data I…Median = 17% II…Median = 12% III….19% ETC ETC….. Q. ARE ALL CHEMICALS EQUALLY HARMFUL ? Ans. Possibly NOT ! Q. Are all Features Equally Important ?

13 Concept of Weight….. Subject Specialist’s Knowledge….. Choice of Weights : Rel. Importance Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0 Interpretation of weights….. Total of Weights = 25.0 Rel. Wts : 2.0/25 = 8%, 3.5/25 =14% etc etc….for all chemicals….. Total of Rel Wts. = 1 OR 100 % Use Rel. Wts. to compute Weighted AM, GM, HM etc

14 Use of Ranks….. Convert Scores into Ranks for Each Item TRI Data Matrix :Convert into Rank Matrix Benzene Cadmium etc etc TRI Scores Ranks I 7% ……...2 II 12% …….4 III 17% …….8 IV 9% ……...3 V 14% …….5 VI 6% …..1 VII 15% …...6 VIII 16% …...7

15 Rank Matrix…. States VS Chemicals : TRI Data ….ranks Be Cd Ca Tr Ch Cy L I 2 II 4 III 8 IV 3 etc etc etc V 5 for each chemical VI 1 VII 6 Then use “aggregate” methods VIII 7 based on ranks

16 Why Ranks ….. Raw Scores…… Aggregate methods are sensitive to Outliers….too high or too low values… Extreme Values…. Use of Trimmed Mean Ranking…..recommended for Robust Results……

17 Less Known Methods…. TOPSIS METHOD ELECTRE METHOD [computation-intensive…..] Concepts : TOPSIS Method Features…..Locations…. Ideal Location Anti-Ideal Location Distance from Ideal….from Anti-ideal Within Feature Variation Composite Index

18 TOPSIS METHOD Technique for Ordering Preferences by Similarity to Ideal Solution Uses Concepts of Ideal & Anti-Ideal Locations Distance from Ideal & Anti-Ideal Locations Weight of Features Sum of Squares for each feature

19 Philosophy for TOPSIS TOPSIS (Technique for Ordering Preferences by Similarity to Ideal Solution) In the absence of a natural course of action for over-all summary measure and ranking….next best alternative course of action would be to assign top rank to the one which has shortest distance from the ideal and farthest distance from the anti-ideal…..

20 Concepts: Ideal & Anti-Ideal States States VS Chemicals : TRI Data Be Cd Ca Tr Ch Cy L I 7% 13% 21% 2% 34% 21% 17% II 12% 9% 18% 3% 42% 28% 11% III 17% 4% 23% 7% 22% 19% 23% IV 9% 11% 17% 5% 25% 23% 19% V 14% 10% 13% 8% 21% 19% 25% VI 6% 11% 19% 5% 33% 21% 22% VII 15% 9% 13% 4% 38% 19% 28% VIII 16% 11% 10% 5% 33% 20% 25% ********************************************************* Ideal... 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal

21 Ideal & Anti-Ideal…… Hypothetical Locations! Abs. Best / Worst States ……Hypothetical Setting up the Limits for others….. Ranking of the others….. Better - Placed States ? Closer to Ideal : Distance from Ideal….small AND ALSO Far from Anti-Ideal : Distance from Anti-Ideal…Large

22 Concepts of Distance….. Euclidean Distance….. Ideal : 6% 4% 10% 2% 21% 19% 11% Anti- 17% 13% 23% 8% 42% 28% 28% Ideal Squarred Distance between Ideal & Its Anti = (6-17)^2 + (4-13)^2 + …. + (11-28)^2

23 Computations…. Distance between Location & ID OR NID…. I : 7% 13% 21% 2% 34% 21% 17% ID: 6% 4% 10% 2% 21% 19% 11% NID 17% 13% 23% 8% 42% 28% 28% Sq.Dis. [ I vs ID] (7- 6)^2 =1, (13 – 4)^2 =81, …… Sq. Dis. [ I vs NID] (7-17)^2 =100, (13 – 13)^2=0, … ….

24 Sq. Dist. Comp. vs Ideal Features\ Chemicals Locations 1 2 3 4 5 6 7 I 1 81 121 0 169 4 36 II 36 25 64 1 441 81 0 III 121 0 169 25 1 0 144 IV 9 49 49 9 16 16 64 V 64 36 9 36 0 0 196 VI 0 49 81 9 144 4 121 VII 81 25 9 4 289 0 289 VIII 100 49 0 9 144 1 256

25 Sq. Dist. Comp. vs Anti-Ideal Features\ Chemicals Locations 1 2 3 4 5 6 7 I 100 0 4 36 64 49 121 II 25 16 25 25 0 0 289 III 0 81 0 1 400 81 25 IV 64 4 36 9 289 25 81 V 9 9 100 0 441 81 9 VI 121 4 16 9 81 49 36 VII 4 16 100 16 16 81 0 VIII 1 4 169 9 81 64 9

26 Choice of Weights….. Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0 Rel. Wts :.08.14.04.18.20.28.08 Sum of Squares for each feature over all locations Be : 7^2+12^2+17^2+9^2+14^2+6^2 +15^2+16^2 = 1236

27 Computation of Feature-wise Sum of Squares Features \ Sum of Squares 1[Be] 1236 2 [Cd] 810 3 [Ca] 2382 4 [Tr] 217 5 [Ch] 5386 6 [Cy] 3678 7 [L] 3818

28 Formation of Composite Indices…. Ingredients Distances, Weights & Sum of Squares Composite Index [CI] : 2 Components derived from Ideal & Anti-Ideal locations For Each Location : Added over all Features Sq.Distance x Wt of Feature Divided by Sum of Squares of feature

29 Two Components…

30 u’s and v’s….min. & max.

31 Details of Computations….. State I : L 2 [I, IDR] = [(7- 6)^2 x 0.08 / 1236 + ….] 1/2 L 2 [I, NIDR =[ (7-17)^2 x 0.08 / 1236 + …] 1/2 CI = Composite Index = L 2 [I, IDR] / {L 2 [I, IDR} + L 2 [I,NIDR]} It is a RATIO between 0 and 1 ….smaller the ratio, better is the placement of the State in over-all comparison across states …..

32 Computational Details : Ideal Locations / Features for Sq Distance wrt Ideal x Weight / SS of Features I.000065.014.002032 0.00.006276.000304.000754 II.002330.004321.001075.000829.016376.006166 0.00 III.007832 0.00.002838.020725.000037 0.00.003016 IV.000582.008469.000823.007461.000594.001216.001340 V.004144.006222.000151.029844 0.00 0.00.004105 VI 0.00.008469.001360.007461.005348.000304.002534 VII.005241.004321.0000015.003316.010732 0.00.006053 VIII.006470.008469 0.00.007461.005348.000076.005362

33 Computational Details : Anti-Ideal Locations / Features for Sq Dis. wrt Anti-Ideal x Weight / SS of Features I.006472 0.00.000067 0.029862.002411.003724.002534 II

34 Final Ranking Table… States L 2 [., IDR] L 2 [., NIDR] CI Rank I II III IV etc etc etc V VI VII VIII

35 Choice of Weights…. Internal & External Importance of Environmental factors…. Use of Shanon’s Entropy Measure  …. Define p iJ = X iJ /  i X iJ = proportion…. Compute for each item  (J) = -  i p iJ ln p iJ / ln (K) Use w(J) = (1 -  (J)) /  r (1-  (r)) Alternatively….use w(J) proportional to cv 2 of Item J …coeff of variation [cv] computed from the data matrix…..

36 Extensions….. Ranking depends critically on Choice of Distance Measure & Choice of Weights Distance Measure : Squared Distance [L 2 ] Mean Deviation : L 1 – Norm

37 Reversal of Roles of Rows & Cols.

38 Pollution Data on 50 US States

39 US Pollution Data [contd.]....

40 US Pollution Data [contd.]….

41 Results of TOPSIS Analysis : Two Sets of Weights [Entropy & CV] & Two Distance Measures [L-1 & L-2]

42 Results …contd…..

43 Results…contd….

44

45

46

47

48 Questions….. Q1. Can the indicators be expressed in original units of measurement or we only need % ? Ans. Yes….original units will do since the formulae indicate unit-free computations. Also see US Original Pollution Data at the end. Q2. What about interdependence among the indicators ?

49 Questions…. Ans. It is believed that the indicators are seemingly uncorrelated. If there is any functional dependence, only the smallest subset of them should be used. Q3. What about PCA ? Ans. That won’t lead to ranking of the locations. Also it will be difficult to interpret the linear combinations of the indicators.


Download ppt "DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012."

Similar presentations


Ads by Google