MEASURES OF CENTRALITY
Last lecture summary Which graphs did we meet? scatter plot (bodový graf) bar chart (sloupcový graf) histogram pie chart (koláčový graf) How do they work, what are their advantages and/or disadvantages?
SDA women – histogram of heights 2014 n = 48 or N = 48 bin size = 3.8
Distributions negatively skewed skewed to the left positively skewed skewed to the left e.g., life expectancye.g., body heighte.g., income
STATISTICS IS BEATIFUL new stuff
Life expectancy data Watch TED talk by Hans Rosling, Gapminder Foundation: tats_you_ve_ever_seen.html tats_you_ve_ever_seen.html
STATISTICS IS DEEP
UC Berkeley Though data are fake, the paradox is the same Simpson’s paradox – Introduction to statistics
Male AppliedAdmittedRate [%] MAJOR A MAJOR B – Introduction to statistics
Male AppliedAdmittedRate [%] MAJOR A MAJOR B – Introduction to statistics
Female AppliedAdmittedRate [%] MAJOR A10080 MAJOR B – Introduction to statistics
Female AppliedAdmittedRate [%] MAJOR A10080 MAJOR B – Introduction to statistics
Gender bias What do you think, is there a gender bias? Who do you think is favored? Male or female? AppliedAdmittedRate [%] MAJOR A MAJOR B10010 AppliedAdmittedRate [%] MAJOR A10080 MAJOR B – Introduction to statistics
Gender bias AppliedAdmittedRate [%] MAJOR A MAJOR B10010 Both AppliedAdmittedRate [%] MAJOR A10080 MAJOR B Both male female – Introduction to statistics
Gender bias Rate [%] MAJOR A50 MAJOR B10 Both46 Rate [%] MAJOR A80 MAJOR B20 Both26 male female – Introduction to statistics
Statistics is ambiguous This example ilustrates how ambiguous the statistics is. In choosing how to graph your data you may majorily impact what people believe to be the case. “I never believe in statistics I didn’t doctor myself.” “Nikdy nevěřím statistice, kterou si sám nezfalšuji.” Who said that? Winston Churchill – Introduction to statistics
What is statistics? Statistics – the science of collecting, organizing, summarizing, analyzing and interpreting data Goal – use imperfect information (our data) to infer facts, make predictions, and make decisions Descriptive statistic – describing and summarising data with numbers or pictures Inferential statistics – making conclusions or decisions based on data
Variables variable – a value or characteristics that can vary from individual to individual example: favorite color, age How variables are classified? quantitative variable – numerical values, often with units of measurement, arise from the how much/how many question, example: age, annual income, number children continuous (spojitá proměnná), example: height, weight discrete (diskrétní proměnná), example: number of children continuous variables can be discretized
Variables categorical (qualitative) variables categories that have no particular order example: favorite color, gender, nationality ordinal they are not numerical but their values have a natural order example: tempterature low/medium/high
variable (proměnná) quantitative (kvantitativní) categorical (kategorická) continuous (spojitá) discrete (diskrétní) ordinal (ordinální) Variables
Choosing a profession ChemistryGeography – – – Statistics
Choosing a profession We made an interval estimate. But ideally we want one number that describes the entire dataset. This allows us to quickly summarize all our data. – Statistics
Choosing a profession 1. The value at which frequency is highest. 2. The value where frequency is lowest. 3. Value in the middle. 4. Biggest value of x-axis. 5. Mean ChemistryGeography – Statistics
Three big M’s The value at which frequency is highest is called the mode. i.e. the most common value is the mode. The value in the middle of the distribution is called the median. The mean is the mean (average is the synonymum). ChemistryGeography – Statistics
Quick quiz What is the mode in our data? – Statistics
Mode in negatively skewed distribution – Statistics
Mode in uniform distribution – Statistics
Multimodal distribution – Statistics
Mode in categorical data – Statistics
More of mode True or False? 1. The mode can be used to describe any type of data we have, whether it’s numerical or categorical. 2. All scores in the dataset affect the mode. 3. If we take a lot of samples from the same population, the mode will be the same in each sample. 4. There is an equation for the mode. Ad mode changes as you change a bin size. Because 3. is not true, we can’t use mode to learn something about our population. Mode depends on how you present the data. – Statistics
Life expectancy data – Statistics: Making Sense of Data
Minimum Sierra Leone minimum = – Statistics: Making Sense of Data
Maximum Japan maximum = – Statistics: Making Sense of Data
Life expectancy data all countries – Statistics: Making Sense of Data
Life expectancy data Egypt half larger half smaller – Statistics: Making Sense of Data
Life expectancy data Minimum = 47.8 Maximum = 83.4 Median = – Statistics: Making Sense of Data
Q Sao Tomé & Príncipe 50 (¼ way) 1 st quartile = – Statistics: Making Sense of Data
Q1 ¾ larger¼ smaller 1 st quartile = – Statistics: Making Sense of Data
Q Netherland Antilles 148 (¾ way) 3 rd quartile = – Statistics: Making Sense of Data
Q3 3 rd quartile = 76.7 ¾ smaller¼ larger – Statistics: Making Sense of Data
Life expectancy data Minimum = 47.8 Maximum = 83.4 Median = st quartile = rd quartile = – Statistics: Making Sense of Data
Box Plot – Statistics: Making Sense of Data
Box plot 1 st quartile 3 rd quartile median minimum maximum
Modified box plot IQR interquartile range 1.5 x IQR outliers
Quartiles, median – how to do it? 79, 68, 88, 69, 90, 74, 87, 93, 76 Find min, max, median, Q1, Q3 in these data. Then, draw the box plot. – Statistics: Making Sense of Data
Another example Min. 1st Qu. Median 3rd Qu. Max , 93, 68, 84, 90, 74
Percentiles věk [roky]
3 rd M – Mean
Salary of 25 players of the American football (NY red Bulls) in median = mean = Mean is not a robust statistic. Median is a robust statistic. Robust statistic
10% trimmed mean … eliminate upper and lower 10% of data Trimmed mean is more robust. Trimmed mean median = mean = % trimmed mean =