Presentation is loading. Please wait.

Presentation is loading. Please wait.

Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005.

Similar presentations


Presentation on theme: "Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005."— Presentation transcript:

1 Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

2 Outline Crime rates example from Wendy Metadata Some considerations by data types –Census –Sample Survey –Administrative Comparisons –Crude vs standardized

3 Crime Rates Example Ebert & Roeper review of Michael Wilson movie Michael Moore hates America Ebert doubted claim that Cdn crime rate 2X the USA rate Moorelies.com | News: Whoa; Stuart Didn't See That One ComingMoorelies.com | News: Whoa; Stuart Didn't See That One Coming Ebert conceded with writer that stats supported claim - figures on right Comparison of STC and US Bureau of Justice Statistics website stats Crimes per 100,000 population - 2003 CanadaUSA All Crimes8,5304,267 Violent crimes958523 Property crimes4,2753,744

4 Crime Rates Example Debunked by Craig from Canada Simplistic comparison –Similar category titles on violent and property crimes but different definitions –Concluded violent crime 2- 3 X times higher in US, property crimes close –Bureau of Justice Statistics Crime & Justice Data OnlineBureau of Justice Statistics Crime & Justice Data Online –Canadian Statistics - Crimes by type of offenceCanadian Statistics - Crimes by type of offence Crimes per 100,000 population - 2002 CanadaUSA Violent crime homicide1.95.6 robbery85146 comparison of US (rape and aggravated assault) difficult with Cdn sexual assault and assaults) Property Crime B & E (Cdn) – Burglary (US)879746 Theft (Cdn) - Larceny & Theft (US)2,1912,446 Motor Vehicle theft516432

5 Crime Rates Example

6 Metadata STC Policy on Informing Users of Data Quality –In place since 1978 tightened up 2000 in response to 1999 AG report »Looked at 4 surveys LFS, CPI, MSM & UCRS –Recognised All statistics are to some extent estimates –To be used with awareness of strengths and weaknesses – fitness for use –Key tool is the Integrated Meta Database that you see definitions, data sources and methods Repository of info on STC surveys and programs

7 Metadata Cant over emphasize importance of good metadata, finding it and reading it –Definitions, Data Sources and Methods (recently revamped) Questionnaire and reporting guides Survey Description Data sources and methodology Data Accuracy Documentation Contact us –Statistics Canada: Canadian Community Health SurveyStatistics Canada: Canadian Community Health Survey

8 Metadata –Online Catalogue (OLC) Canadian Community Health Survey: public use microdata file: Product main pageCanadian Community Health Survey: public use microdata file: Product main page –DLI website DLI - Canadian Community Health Survey Cycle 1.1DLI - Canadian Community Health Survey Cycle 1.1 –DLI listserv Ask and we will find out from the Division

9 Metadata With Public Use Microdata Files, the code book is very important –Gives questions asked and codes used for responses –Missing values, refusals, dont know and not applicable numeric codes are often assigned –Not consistent in the numeric codes used –Numeric codes that to most software would seem to be valid response

10 Metadata 1990 Health Promotion Survey there were a series of questions about alcohol consumption. First they asked if the respondent EVER drank alcohol, and if YES asked if they drank within the last 12 months and if YES asked for number of drinks for each day for the past 7 days. The code book showed number of drinks per day as: 81 F4MON 2 0096 0097 HOW MANY DRINKS DID YOU HAVE ON: MONDAY ? 00 NONE 4651/ 7334907 01:40 NUMBER OF DRINKS 1403/ 2585080 41 MORE THAN 40 DRINKS 1/ 106 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2 82 F4TUE 2 0098 0099 HOW MANY DRINKS DID YOU HAVE ON: TUESDAY ? 00 NONE 4608/ 7306101 01:40 NUMBER OF DRINKS 1447/ 2613991 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2

11 Some Considerations by Data Type Census –Short form - 9 questions are 100% –Long form – 20% sample Sample Survey –Most data sets – LFS, GSS, NPHS, etc Administrative –GST, Revenue Canada, Vital Stats, school enrollments, provincial health insurance, …

12 Census High quality but Non sampling errors –coverage, measurement, non response, processing errors Key documents are the Census Handbook, Census Dictionary and Census Technical reports Communiqué for revisions –Population and dwelling count amendmentsPopulation and dwelling count amendments –Dont change the Census base

13 Census Conceptual/definition changes over time can be very important Census family –Refers to a married couple (with or without children of either or both spouses), …. … A couple living common-law may be of opposite or same sex. Children in a census family include grandchildren living with their grandparent(s) but with no parents present –census family, 2001 censuscensus family, 2001 census Economic family –Refers to a group of two or more persons who live in the same dwelling and are related to each other by blood, marriage, common-law or adoption. –economic family, 2001 censuseconomic family, 2001 census

14 Sample Surveys Estimates –Estimate of the population characteristics based on a sample from a survey frame –Bigger sample gives better estimates Issue of sample size –30,000 sample –Want sub population – retirees ~ 3000, males ~1400, immigrants ~ 200, BC ~ 40 –Unstable estimates as you break down the sample –Often forget estimate has a confidence interval 73% with a CI 10% is not significantly different than 80%

15 Sample Surveys Statistical measures of quality –Coefficient of Varience (CV) gives Standard Deviation as % of Mean Measure of the fitness for use –smaller the CV, the more reliable the estimate is –CVs < or = 15% generally considered reliable for most uses –CVs > 15% but < 33% are reliable for some purposes with caution –CVs > 33% are unreliable and not published

16 Data Quality Symbols

17 Sample Surveys Sample value weighted up to represent population –20% sample for census Simple weight is 5, more complex, adjusted for characteristics, response rates, etc –example from Mike Another Health survey Analyst confusion on weight and height asked in survey –Used body weight as the survey weight –Survey weight was around 400 – … number of obese Cdns!!

18 Sample Surveys Changes in frame used for the sample –Annual Survey of Manufacturers moved to the Business Register (ref yr. 2000) 25,000 incorporated firms missing from survey coverage before –5% (1/3) of 15% increase from 1999 – 2000 –ASM also changed survey coverage included 35,000 incorporated firms below $30,000 annual sales –2% of 15% increase from 1999 – 2000 –Almost half the 15% annual increase from coverage improvements –Annual Survey of Manufactures (ASM)Annual Survey of Manufactures (ASM)

19 Administrative Data Original purpose that the data was collected –Provincial Health counts differ from Census Definitions used arent the same –Success rate higher for students at some universities (mostly in QC) –Deregister 4 weeks into course, elsewhere is 3 to 4 days Coverage of the universe (total population) –not everyone reports income tax Administrative changes can affect data series

20 Administrative Data Provide small area estimates –Normally postal code geography –Postal code can be problematic Highest income neighbourhood example

21 Crude vs Standardised Comparisons between countries

22 Crude vs Standardised

23

24 Another mortality comparison but over time –1951 - 2.83 per 1000 from heart disease –1993- 1.93 –Improvement from advances - 0.9 ? –change due to progress - 2.19 –change due to aging +1.29

25 Thank you!


Download ppt "Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005."

Similar presentations


Ads by Google