Download presentation
Presentation is loading. Please wait.
Published byElvin Hart Modified over 9 years ago
1
Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission
2
Announcements 1. Lab meets Monday 1:25, Blegen 440 2. Course lecture notes at: –http://www.soc.umn.edu/~schofer –Click on “Soc 5811”, go to “Course Files”
3
From Measurement to Datasets Suppose we: –1. Choose a unit of analysis –2. Choose a measurement strategy –3. Take measurements on relevant cases Result: We end up with sets of measurements on a group of cases Q: What next? A: Data is often organized in a spread sheet: –Rows contain all measurements on each case –Columns reflect sets of measurements or “variables”
4
Datasets: Example Suppose we measured 5 people regarding views on gun control and gun ownership: PersonViews on Gun Control # Guns owned 1Favor0 2Oppose3 3Favor0 4 1 5Oppose1 Rows contain all info on each person (a case) Columns contain all measurements on a particular topic (a variable)
5
From Measurement to Datasets Issue: To facilitate data analysis, it is best to enter data as numbers, rather than text –Often called “coding” data Less good option: Use text words “Favor” and “Oppose” in our gun control dataset Better option: Convert “Favor” and “Oppose” to numeric values –Example: 1 = favor, 0 = oppose Advantage: more computation options Disadvantage: Data is harder to interpret by eye
6
Datasets: Recoded Example In this dataset, “Favor” was recoded to 1, “Oppose” to zero. PersonViews on Gun Control # Guns owned 110 203 310 411 501 Note that it is harder to visually determine the meaning of the variable. You have to remember what the numbers mean…
7
Review Measurement: The task of gathering information that characterizes or represents a social phenomena Q: What is “Unit of Analysis”? –Answer: The type of thing which we are collecting information about Q: What are 3 measurement scales? Examples? Nominal Ordinal Interval / Ratio
8
Review: Measurement Problems Problems that arose in survey given last class: Question 10: What transportation do you generally use to get to class –Answer: Both “car” and “public transportation” Question 9: How many miles away do you live? –Answer: 4 blocks Question 6 (Liberal or conservative, from 1-10) –Answer: “3 or 4” How many CDs do you own? –Answer: “Over 100”
9
Today’s Class: Describing Information Tools for describing a single variable: List, Frequency lists, charts, histograms Characterization of “Typical” cases –Ex: Mean (“average”), Mode, Median Characterization of Variation –Ex: Min, Max, Variance, etc.
10
Listing Variables Lists: Values of a variable for all cases Looking at the “raw data” Report command in SPSS –Or just look at data in the SPSS data editor Advantages: –Easy –Gives a rich description – you can see every case Disadvantages: –Cumbersome for large datasets –If data involves complex coding, you may not be able to interpret it visually
11
Frequency Lists Frequency Lists: Tables that show how many cases take on a particular value –Also called “frequencies”, “frequency distributions” Examples: –Congressional vote. How many “Yes” vs “No”? –Social class: How many = low, middle, upper? –Age: How many = 1 years old, 2 years, … 100 years? Relevant SPSS Command: Frequencies
12
Example from SPSS Note: Men coded as 1, Women coded as 2 GENDER Freq.%Valid%Cuml. % 1.00 633.335.335.3 2.00 1161.164.7100.0 Total 1794.4100.0 Missing: Systm 15.6 Total 18100.0
13
Frequency Lists Advantages: –Useful for large datasets –Fairly rich description of data – once you get used to reading them… Disadvantages: –Unlike a list, you can’t see which case is which or compare with other variables –Best for nominal and some ordinal variables only –Not useful if all values are unique, such as: rank orderings, many continuous variables
14
Visual Representations: Bar Charts “Bar Chart” –Essentially a visual representation of a frequency list –Height of bars represent number of cases –For nominal & some ordinal variables only Again, rank orders and continuous measures don’t work “Pie Chart” –Similar, but divides up a circle to show frequency All Accessible within Frequencies Menu –Just click Chart button –Or, look under Graphs menu
15
SPSS Bar Chart
16
A Similar Approach: Pie Chart
17
Graphing Continuous Measures Issue: Continuous variables have an infinite possible number of unique values. Cases rarely have the exact same value Bar chart would have many bars of height 1 What would you do about zeros? Solution: use “grouped data” Sets of similar values must be “grouped” –Lumped together by constant intervals –Note: Information is destroyed in the process Result: A “Histogram” –Height of bar represents number of cases within a given range of values
18
Histogram: Age (5-year interval) This doesn’t mean that 200 cases are exactly 30 years old… Rather, 200 cases fall in the 5-year interval around age 30 (from 27.5 and 32.5)
19
Histograms: Interval Width Previous example: People were grouped by age, within 5-year intervals –Bars represented ages 17.5-22.5, 22.5-27.5 and so on It is also possible to group people within 1 year intervals – or 50 year intervals –Small interval = more bars in the histogram –Wide interval = fewer bars in the histogram WARNING: Histograms look very different depending on how wide you set the intervals
20
Histogram: Age (1-year interval)
21
Histogram: Age (20-year interval)
22
Histograms: Interval Width Changing the number of “bars” in the histogram alters the appearance of the graph Wide intervals/few bars results in greater simplification of data Suggestion –1. Try different intervals In SPSS, go to “interactive histogram” –2. Don’t over-interpret a crude histogram Another example: National Wealth –Unit of analysis = country –Variable = GDP per capita, a measure of wealth
23
Histogram: Wide Intervals National Wealth 1990
24
Histogram: Narrow Intervals National Wealth 1990
25
Histograms Advantages: 1. Useful for even continuous measures 2. Preserves information on distribution of variable Both peaks and zeros are apparent Disadvantages: 1. Interval width can be a problem Too Wide results in loss of information Too Narrow results in too many bars – unreadable.
26
Interpreting Histograms: Age Try to interpret: What is this sample like?
27
Interpreting Histograms: Age Try to interpret this histogram:
28
Interpreting Histograms: Age Try to interpret this histogram:
29
Measures of “Central Tendency” Often, it is important to assess the “typical” values of a variable Examples: –We may wish to know how much money the typical family earns –We may wish to know the age of the typical person in our dataset Solution: Conduct calculations to determine what values are “typical However, this isn’t as easy as it sounds –Consider some examples…
30
What is the “Center”? National Wealth 1990
31
What is the “Center”?
32
The “Mode” The Mode = the value representing the largest number of cases -- called the “Modal” value Useful for Nominal, Ordinal variables Only useful for Continuous variables if you have grouped data into a histogram Otherwise, all values may very likely be unique Issue: Mode is not very helpful (even misleading) in certain circumstances Ex: If there are many peaks, or a single unusual one Ex: If the variable is distributed quite evenly.
33
Mode: Example Here, the mode is 2 (which corresponds to “female”)
34
Mode: Example Here, the mode is 30 (though it might be different if the histogram had a different interval width)
35
Mode: Example In this case, the mode (45) is not helpful
36
Median The Median = the value of the “middle case” Equal number of cases fall higher or lower Can be used for ordinal, continuous variables Advantages: 1. Not influenced by unusual peaks 2. Useful even in very even distributions Disadvantages: 1. Not useful for data spread in two distinct “clumps.”
37
Median Example The median case is 42 years old. Half are older, half are younger!
38
Mean – “Average” The most well-known way of assessing the “middle” Calculated by adding values of all cases, then dividing by the total number of cases Advantages: Useful for continuous measures Not overly influenced by any single peak Disadvantages: Can be influenced by extreme values.
39
Calculating the Mean: Variables Each column of a dataset is considered a variable We’ll refer to a column generically as “Y” Person# Guns owned 10 23 30 41 51 The variable “Y” Note: The total number of cases in the dataset is referred to as “N”. Here, N=5.
40
Equation of Mean: Notation Each case can be identified a subscript Y i represents “ith” case of variable Y i goes from 1 to N Y 1 = value of Y for first case in spreadsheet Y 2 = value for second case, etc. Y N = value for last case Person# Guns owned (Y) 1Y 1 = 0 2Y 2 = 3 3Y 3 = 0 4Y 4 = 1 5Y 5 = 1
41
Calculating the Mean Equation: 1. Mean of variable Y represented by Y with a line on top – called “Y-bar” 2. Equals sign means equals: “is calculated by the following…” 3. N refers to the total number of cases for which there is data Summation ( ) – will be explained next…
42
Equation of Mean: Summation Sigma (Σ): Summation –Indicates that you should add up a series of numbers The thing on the right is the item to be added repeatedly The things on top and bottom tell you how many times to add up Y-sub-i… AND what numbers to substitute for i.
43
Equation of Mean: Summation 1. Start with bottom: i = 1. –The first number to add is Y-sub-1 2. Then, allow i to increase by 1 –The second number to add is i = 2, then i = 3 3. Keep adding numbers until i = N –In this case N=5, so stop at 5
44
Equation for the Mean: Example CaseNum CD’s 120Y1Y1 240Y2Y2 30Y3Y3 470Y4Y4 Variable: Number of CD’s… How many CD’s does a person own?
45
Equation of the Mean: Example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.