The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Slides:



Advertisements
Similar presentations
Aggregate Data and Statistics
Advertisements

DLI Orientation: Concepts A Framework for Thinking about Statistical Information Train the Trainers Montreal, March 9, 2004 Chuck Humphrey Data Library.
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
Labour Force Historical Review Sandra Keys, University of Waterloo DLI OntarioTraining University of Guelph, Guelph, ON April 12, 2006.
Section 1.3 Experimental Design © 2012 Pearson Education, Inc. All rights reserved. 1 of 61.
Section 1.3 Experimental Design.
Dissemination of U.S. Census Data and Results: The role of ICPSR First Conference of Al-Khawarezmi Committee on Statistics Doha, Qatar 6-8 December 2010.
Chuck Humphrey Data Library University of Alberta.
Designing a Continuum of Learning to Assess Mathematical Practice NCSM April, 2011.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 1-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Anna Bombak, Chuck Humphrey, Lindsay Johnston and Leah Vanderjagt University of Alberta The Winter Institute on Statistical Literacy for Librarians Demystifying.
ORC International Proprietary & Confidential Stress Awareness Month Survey Report April 7, 2015 EMBARGOED UNTIL 8:00 AM, April 13, 2015.
Geo-referenced data and DLI aggregate data sources Chuck Humphrey University of Alberta September 29, 2008.
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 26, 2009.
Chuck Humphrey University of Alberta Digital Reference: Statistics & Data LIS 536 March 4, 2009.
Chuck Humphrey, Leah Vanderjagt and Anna Bombak University of Alberta The Winter Institute on Statistical Literacy for Librarians Demystifying statistics.
Anna Bombak, Chuck Humphrey, Lindsay Johnston and Leah Vanderjagt University of Alberta The Winter Institute on Statistical Literacy for Librarians Demystifying.
Chuck Humphrey & Lynne Robinson University of Alberta Surviving Statistics Strategies for dealing with statistical questions on the reference desk.
Searching the University of Alberta Library’s Statistics Canada-based Websites 2001 Census of Canada Canadian Centre for Justice Statistics Canadian Business.
Anna Bombak, Chuck Humphrey, Lindsay Johnston, Angie Mandeville and Leah Vanderjagt Winter Institute on Statistical Literacy for Librarians, February 18-20,
Chuck Humphrey, Leah Vanderjagt and Anna Bombak University of Alberta The Winter Institute on Statistical Literacy for Librarians Demystifying statistics.
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library March 6, 2009.
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
EAS 293 Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 14, 2008.
CHAPTER 14, QUANTITATIVE DATA ANALYSIS. Chapter Outline  Quantification of Data  Univariate Analysis  Subgroup Comparisons  Bivariate Analysis  Introduction.
Anna Bombak, Chuck Humphrey, Angie Mandeville, Leah Vanderjagt and Amanda Wakaruk Winter Institute on Statistical Literacy for Librarians, February 23-25,
Introduction to Statistical Literacy : A Low pain and high gain presentation Garth Homer, 02/11/09.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 1-1 Chapter 1 Introduction and Data Collection Basic Business Statistics 11 th Edition.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 1-1 Chapter 1 Introduction and Data Collection Basic Business Statistics.
Collecting, Presenting, and Analyzing Research Data By: Zainal A. Hasibuan Research methodology and Scientific Writing W# 9 Faculty.
© 2009 Pearson Education, Inc publishing as Prentice Hall 4-1 Chapter Four Exploratory Research Design: Secondary Data.
Statistics are ubiquitous “Statistics are generated today about nearly every activity on the planet. Never before have we had so much statistical information.
Experimental Design 1 Section 1.3. Section 1.3 Objectives 2 Discuss how to design a statistical study Discuss data collection techniques Discuss how to.
Packaged Serendipity: Preserving Context through Metadata Robert Cole Sharon Farnel Chuck Humphrey Digital Preservation Seminar University of Alberta 5.
Chapter Four Chapter Four.
Data and Social Research Chuck Humphrey Data Library Rutherford North Library.
Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.
1 Course review, syllabus, etc. Chapter 1 – Introduction Chapter 2 – Graphical Techniques Quantitative Business Methods A First Course
DLI Workshop -- Mar Hosted by Dalhousie University March 2000 DLI Training Workshop.
The Process of Conducting Research
Chuck Humphrey, University of Alberta Digital Reference: Statistics and Data LIS 536 March 5, 2008.
The Census of Canada and Immigration & Ethno-cultural Data Chuck Humphrey University of Alberta February 10, 2006.
DLI Boot Camp 2011 Finding Statistics: Tools and Techniques Jean Blackburn Vancouver Island University Library SDA.
Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 1 Chapter Introduction to Statistics 1.
Data Discovery The reference interview. Always begin by clarifying the distinction between statistics and data with your patron. Never assume that the.
Chapter 6: Getting the Marketing Information We Need.
The Practice of Social Research Chapter 14 – Quantitative Data Analysis.
Chapter 1: Exploring Data Sec. 1.1 Analyzing Categorical Data.
Units of Analysis The Basics. Outline An illustration Definitions Elements of the unit of analysis Complexity Data structure.
Data Analysis.
Instituto Nacional de Estadística, Geografía e Informática (INEGI), Mexico National Economic Surveys (NES) Jun 2007.
Aim: How do we analyze data with a two-way table?
DATA and STATISTICS … at your service! S.Mowers & the GSG team ©2009, University of Ottawa.
Sociology 343 Chuck Humphrey Data Library University of Alberta.
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
Section 1.3 Experimental Design.
Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003.
Data in context Chapter 1 of Data Basics. Frameworks Today, we will be presenting two frameworks for thinking about the content of data services. A.Statistics.
Chapter Four Chapter 4 Exploratory Research Design: Secondary Data.
Hosted by the University of Regina Library December 1999 DLI Training Workshop Chuck Humphrey.
Data and Statistics: As easy as 1-2-3? Carolyn DeLorey, MLIS St. Francis Xavier University Atlantic DLI Workshop UNB Fredericton April 28, 2015.
User Services Focus, value and attitude Vocabulary stories: wash & wear, circ & dingo Statistics and data.
Chapter 29 Conducting Market Research. Objectives  Explain the steps in designing and conducting market research  Compare primary and secondary data.
Information Sources Focus: The Census October 2008 S.Mowers and the GSG team.
DLI Orientation: Concepts
Finding Answers through Data Collection
An Example of Working with Data Documentation
Telling Canada’s story in numbers Marie-Josée Major
Presentation transcript:

The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte, David Sulz, and Amanda Wakaruk February 18-20, 2015

Outline  Introductions  A framework for understanding statistics  How geography shapes statistics  Official statistics: national  Official statistics: international  Non-official statistics  Applying what you have learned 2

Introductions: your backgrounds  Please introduce yourself  Your name  Your institutional affiliation  What your current occupation activity is 3

Introductions: your backgrounds  A little over three- fourths are from an academic library setting. The split in earlier Institutes was closer to 50/50.  The largest group, with 11, is from universities other than the U of A.  The second largest group, with 6 each, is from SLIS & Government. Academic Library Setting Non- Academic Libraries Other Universities UAL Government Public / Special (11) (04) (06) (00) (06) SLIS

Introductions: your backgrounds  Geographically, 11 of you are from outside Alberta.  Nine are from four other provinces: six BC, one each from NS, ON, and SK.  Sixteen are from the Edmonton region.  Two are from outside Canada: one from the United States and one from South Africa, welcome! Alberta Outside Canada (16) Outside Alberta (09) (02)

Uses of quantitative evidence  To provide a description  This typically entails answering the question about the scale or scope of something observable and its characteristics.  To make a comparison  This usually involves establishing the degree of similarity or dissimilarity among observables.  To identify a relationship  This method looks at the correlation among characteristics of observables, that is, how are things related? 6

Examples of quantitative evidence 7  Description  Inside the 2014 Forbes billionaires list: facts and figures Inside the 2014 Forbes billionaires list: facts and figures  Forbes, March 3, 2014  Comparison  New alarm bells over household debt New alarm bells over household debt  The Globe and Mail, Feb 5, 2015  Relationship  American gun ownership and suicide rates American gun ownership and suicide rates  The Economist, Feb 2, 2015

Statistics are ubiquitous “Statistics are generated today about nearly every activity on the planet. Never before have we had so much statistical information about the world in which we live. Why is this type of information so abundant? For one thing, statistics have become a form of currency in today’s information society. Through computing technology, society has become very proficient in calculating statistics from the vast quantities of data that are collected. As a result, our lives involve daily transactions revolving around some use of statistical information.” Data Basics, page 1.1 8

More statistics in electronic formats 9  In the past 25 years we have had a decline in the publication of official and non-official statistics in print, while the publication of these statistics in electronic formats has grown exponentially.  This has just heightened the problem of finding statistics.

Statistics: what are we talking about?  Statistics and data are related but different 10

How statistics and data differ 11 Statistics Data

How statistics and data differ 12

13

Microdata 14

Microdata record layout 15

How statistics and data differ 16 Statistics numeric facts and figures that provide summaries derived from data, i.e, already processed requires definitions and classifications presentation-ready published Data numeric files created and organized for processing requires processing to be useful requires detailed documentation not display-ready disseminated, not published

Statistics are presentation ready  Tables, charts, and graphs are typically used to display statistics. You will find statistics sprinkled in text as part of a narrative describing some phenomenon; but tables and charts are the primary methods of organizing and presenting statistics. 17

A statistic isn’t real without data  A ‘real’ statistic requires a data source. If the publisher of a statistic can’t tell you the data source behind a statistic, you should question that the statistic is ‘real.’ After all, people do make up statistics.  Notorious example: In an interview with Meredith Whitney on the December 19, 2010 episode of CBS’ 60 Minutes, she claimed that 50 to 100 “sizable” cities and counties in the U.S. would default on billions of dollars of municipal bonds. Her estimate sparked a mini-panic on the bond market. She refused to release the report behind these predictions on the grounds that her research is proprietary. Bloomberg revealed on February 1, 2011 that she “doesn’t have any numbers to back up her assertions -- she pulled the numbers out of thin air.”Bloomberg she pulled the numbers out of thin air 18

Fabricated statistics 19 Fox News guest Steven Emerson says Birmingham is 'totally Muslim' He has since admitted that is inaccurate, and for good reason. The stats say Birmingham is about 20 percent Muslim. Kathie Sanders, Jan 14, 2015 Jan 14, 2015

Why fabricated statistics are accepted 20 Most people around the world are pretty bad when it comes to knowing the numbers behind the news. Alberto Nardelli and George Arnett, “Today’s key fact: you are probably wrong about almost everything”, The Guardian, Oct 29, 2014 “Today’s key fact: you are probably wrong about almost everything”

Fabricated data 21 Diederik Stapel, a Dutch social psychologist, perpetrated an audacious academic fraud by making up studies that told the world what it wanted to hear about human nature. Yudhijit Bhattacharjee, “The Mind of a Con Man,” New York Times Magazine, April 26, 2013“The Mind of a Con Man,”

Erroneous statistics from flawed data Reinhart, Rogoff, and the Excel Error That Changed History Peter Coy, Bloomberg Businessweek, April 18, 2013

Misinterpretation of statistics  Some make wrong generalizations from statistics.  Notorious example: Approximately two year ago during the Republican Party presidential primaries, Rick Santorum claimed on television and on the campaign trail that “62 percent of kids who enter college with some kind of faith commitment leave without it.” Stephen Colbert suggested that this statistic had to be taken “on faith.” Jonathan Hill reported that “Studies using comparable data from recent cohorts of young people (for example, the National Longitudinal Survey of Youth 1997, the National Longitudinal Study of Adolescent Health, and the National Study of Youth and Religion) have found virtually no overall differences on most measures of identity, practice, and belief between those who [go] to college and those who do not.”Jonathan Hill reported 23

 A statistical concept may be derived from different data sources and show different results.  Notorious example: A long-standing debate erupted over a Lancet article published in 2004 that estimated the number of civilian deaths in Iraq, following the 18 months after the invasion, to be around 98,000. The Iraq Body Count project compiled a database of reported civilian deaths showing between 11,000 and 13,000 deaths in this same period. The UK government embraced statistics from the Iraq Ministry of Health, which reported 3,853 civilian deaths and 15,517 injuries over six months in 2004.number of civilian deaths Same statistical concept but different data sources 24

 A statistic may have been derived from poor quality data and, consequently, may be of limited value. But nevertheless, it remains a ‘real’ statistic.  The desire is to have quality statistics that are derived from quality data.  Statistics Canada uses criteria to define quality statistics or statistics “fit for use”quality  Relevance, accuracy, timeliness, accessibility, interpretability, and coherence Quality data needed 25

Methods producing data Observational Methods Experimental Methods Computational Methods Focus is on developing observational instruments to collect data Focus is on manipulating causal agents to measure change in a response agent Focus is on modeling phenomena through mathematical equations CorrelationCausationPrediction Replicate the analysis (same data or similar) Replicate the experiment Replicate the simulation Statistics summarize observations Statistics summarize experiment results Statistics summarize simulation results

Lifecycle production of data  The production of data across these three methods happens through a lifecycle process. Understanding the basics of the lifecycle process in which statistics are derived from data can help in the search for statistics. 27

Lifecycle of survey statistics 1Program objective 2Survey unit organized 3Questionnaire & sample 4Data collection 5Data production & release 6Analysis 7Findings released 8Popularizing findings 9Needs & gaps evaluation

Lifecycle applied to health statistics 1Program objectives increased emphasis on health promotion and disease prevention; decentralization of accountability and decision- making; shift from hospital to community-based services; integration of agencies, programs and services; and increased efficiency and effectiveness in service delivery Health Information Roadmap Initiative 29

Health Information Roadmap Initiative 2Survey unit organized 3Questionnaire & sample 4Data collection 5Data production & release 6Analysis 7Official findings released Lifecycle applied to health statistics 30

Reconstructing statistics  One way to see the relationship between statistics and the data upon which they were derived is to reconstruct statistics that someone else has produced from data that are publicly accessible.

Health Information Roadmap Initiative 1Program objective 2Survey unit organized 3Questionnaire & sample 4Data collection 5Data production & release 6Analysis 7Official findings released 8Popularizing findings 9Needs & gaps evaluation Reconstructing statistics 32

 The statistics that we will reconstruct are reported in “Health Facts from the 1994 National Population Health Survey,” Canadian Social Trends, Spring 1996, pp  The steps we will follow are:  identify the characteristics of the respondents in the article;  identify the data source;  locate these characteristics in the data documentation;  find the original questions used to collect the data;  retrieve the data; and  run an analysis to reproduce the statistics. Reconstructing statistics 33

The findings to be reproduced Page 26

Summary of variables identified  Findings apply to Canadian adults  Likely need age of respondents  Men and women  Look for the sex of respondents  Type of drinkers  Look for frequency of drinking or a variable categorizing types of drinkers  Age  Look for actual age or age in categories  Smokers  Look for smoking status 35

Identify the data source  Survey title is identified: National Population Health Survey,  Public-use microdata file is announced  Page 25 of the article

Locate the variables  Examine the data documentation for the National Population Health Survey,  PDF version is on-line PDF version  Use TOC and link to “Data Dictionary for Health”  Identify the variables from their content  NOTE: check how missing data were handled  Trace the variables back the questionnaire  Did sampling method require weighting cases?  NOTE: in addition to the other variables, is a weight variable needed to adjust for the sampling method? 37

Retrieve and analyze the data  The microdata for the NPHS is available through Statistics Canada without cost. However, a licence regarding the terms of use must be signed. Universities subscribed to the Data Liberation Initiative (DLI) can download the data directly.  Make use of local data services to retrieve data from the NPHS.local data services to retrieve data 38

Lessons from the NPHS example  This example demonstrates the distinction between producing statistics and interpreting statistics that have been published by others.  This is an important distinction because:  Choices are made in creating statistics.  Interpreting statistics requires an ability to understand the choices that were made.  Searching for statistics that others have published can be facilitated by understanding these points. 39

Attributes of statistics 40  Content or subject  What was observed?  Geographic setting  Where was it observed?  Time coverage  When was it observed?  Metric of measurement  How was it measured?

Six dimensions or variables in this table The cells in the table are the number of estimated smokers. Geography Region Time Periods Social Content Smokers Education Age Sex Attributes of statistics 41

Statistics are about definitions 42

Statistics are about definitions  Statistics are dependent on definitions. You may think of statistics as numbers, but the numbers represent measurements or observations based on specific definitions.definitions  As just shown, tables are structured around geography, time, and content based on the attributes of the unit of observation. These properties are all depend on definitions. 43

Statistics are about definitions  Consider the following example from the Canadian Census on the data behind statistics about visible minorities. This table displays the size of the visible minority population in Canada from the 2006 Census. Visible Minority Groups (15), Generation Status (4), Age Groups (9) and Sex (3) for the Population 15 Years and Over of Canada, Provinces, Territories, Census Metropolitan Areas and Census Agglomerations, 2006 Census - 20% Sample Data 44

 How is visible minority status identified in the Census? Are aboriginals among the visible minority in Canada? What is the definition of visible minority? Statistics are about definitions 45

46

47

Classifications Sex Total Male Female Periods Statistics involve classifications 48

Some classifications are based on standards while others are based on convention or practice. For example, Standard Geography classificationsStandard Geography classifications Statistics involve classifications 49

Statistics involve classifications  The definitions that shape statistics specify the metric of the data they summarize (for example, Canadian dollars) or the categories used to classify things if a statistic represents counts or frequencies. In this latter case, classification systems are used to identify categories of membership in a concept’s definition.  Examples of standard classifications include the North American Industrial Classification System (NAICS), the National Occupational Classification (NOC-S) and the International Classification of Diseases (ICD).NAICSNOC-SICD  Look at these examples and describe the coding systems used. 50

A quick review  To this point, we have established that:  Statistics are ‘real’ only if they are derived from data;  Statistics are dependent of definitions of the concepts they summarize;  Statistics that represent counts of things in the data employ classification systems, which are based either on standards or convention;  Statistics are numeric summaries over geography, time, and content; and  Statistics are typically organized for display using tables or charts. 51

Metadata for statistics 52  “Tips for Reading a Statistical Table” provides a list of information that a table should provide.  Metadata for statistical tables 1. A title for the table containing references to the content, geography, time, and unit of measurement; 2. A reference in the title or a note about the identity of the individual or organization who produced the table; 3. A date when the table was published; 4. Definitions of concepts or sources to classification systems or standards used; 5. The type of classification system use for the headings of columns and rows;

Metadata for statistics Notes helpful for interpreting the information in the cells of the table, such as a description of any special steps taken in preparing the statistics; 7. Information indicating whether the statistics were derived from a raw or a weighted sample (or possibly both); 8. The sample size if the data come from a sample; 9. The unit of observation for the data used; and 10. The agency that produced the data file.

Looking for statistics  Who would publish this statistic?  What organization would publish this statistic?  Is this is a statistic that would be published by a public or a private organization?  Is this a statistic that would be published as an operational requirement?  Can you identify a data source for the statistic?  What data source would be used to produce this statistic?  Who would produce this data source?  Would there be a distributor for the data source? 56

Looking for statistics  What view of the data would be shown in this statistic?  What would be the level of geography?  What time period would be shown?  What social characteristics would be shown?  Why would someone show this view of the data?  What metadata would describe this statistic?  What definitions would describe the geography, time, or social characteristics?  What standard classification system would be used for the categories of the statistic? 57

Search strategies for statistics  Over the next two days, we will talk about two general search strategies for finding statistics.  The “publisher” strategy is to identify an organization that would produce and publish such a statistic. This approach relies on knowledge of statistical producers. Understanding governmental structure and the content for which its agencies are responsible is an example of public sector sources.  The data strategy is to identify a data source from which the statistics were derived. This approach replies on knowledge of data sources produced by agencies or organizations. 58

Framework