Lecture 1: Introduction to Data Ethan Fosse, Ph.D.

Lecture 1: Introduction to Data Ethan Fosse, Ph.D.
Stat E-100 Lecture 1: Introduction to Data Ethan Fosse, Ph.D.

Teaching Staff Instructor: Ethan Fosse, Ph.D.
Head Teaching Assistant: Mark Ouchida, A.L.M., M.Ed. Teaching Assistants: Regan Bernhard, Rohit Goyal, Kela Roberts See course website for contact information

Course Textbook OpenIntro Statistics, 3rd Edition
Available free online at: p Also may be purchased on Amazon for about $10 My philosophy: Textbooks should be high quality but inexpensive

Assigned TA and Sections
You will be assigned a teaching assistant (TA) next week Your TA is your first point of contact for questions about the problem sets, key ideas, and statistical concepts We will offer several sections that will cover the course material in more depth You may attend any (or all) sections

Course Grade for Undergraduates
Regularly-assigned problem sets that will count for 30% of your grade A midterm exam that will count for 30% of your grade A final exam that will count for 40% of your grade See syllabus for the course grade for graduate students (an additional final project is required)

Problem Sets Problem sets will be posted online
They will be based on problems in the OpenIntro statistics textbook There are ten problem sets throughout the semester First problem set is due on September 17th and will be posted online next week

Midterm and Final Exams
The midterm exam is given on October 22nd and the final exam on December 17th Both exams are given entirely online through the course website Each exam will last 2 hours and can be taken at any point during a 24-hour time period

Midterm and Final Exams (cont.)
Each exam consists of three sections: first, a set of multiple choice questions; second, a set of numerical answer questions; finally, several long answer questions Both exams are entirely “open book” in that you can use any textbook, notes, problem sets, or lecture slides to help you answer the questions

Statistical Programming Languages
Excel: popular among Wall Street types and business school grads (has limited programming functionality) MATLAB: often used by neuroscientists and engineers SPSS: widely used by psychologists but has limited functionality SAS: popular among biologists and some biostatisticians Python: popular among computer scientists; somewhat limited statistics functionality but improving Stata: popular among economists and financial analysts; also used by biostatisticians R: used by statisticians, biostatisticians, and cool people; often used with RStudio

R is Extremely Popular!

Downloading and Installing RStudio
Where to download RStudio: nload/ How are R and RStudio related? R is the underlying programming language we use, while Rstudio provide a graphical front-end that has a number of useful features, including syntax highlighting and windowing For this course we use RStudio with R, but there are other graphical front-ends that some people use with R

Supplementary Texts for R
For learning R, you are recommended to use the Five College Guide to R and Start Teaching with R Both are posted on the course website R will be taught in the course sections based on the material covered in these texts as well as during lecture

Key R Commands for Each Lecture
Each lecture we will post a handout for the key R commands used These commands will be reviewed during the sections Most of the commands will be based on the R package “mosaic”

Using the Mosaic Package
To install the mosaic package: install.packages(“mosaic”) To load the mosaic package when you start your R session: library(“mosaic”)

Tentative Course Map Descriptive Statistics Introduction to Data
Categorical Data Numerical Data Descriptive Statistics Probability Tables and Relative Risk Correlation Analysis Simple Linear Regression Basics of Sampling Sampling Distribution Tests for Means Tests for Proportions Tests for Contingency Tables and ANOVA Inferences for Correlation and Regression Course Review

Myths About Statistics

Four Misconceptions About Statistics
There are several, but I will begin with four: Statistics is mathematics Statistics is dry and dull (and definitely not “sexy”) Statistics is irrelevant to the real world Statistics is not really that useful except for research scientists These are completely incorrect. Let’s review these!

Myth 1: Statistics is Mathematics

Reminder: Statistics is Not Mathematics!
Mathematics is better understood as a tool statisticians use, but it’s not the only skill or tool Often statisticians use mathematics because they can clarify what’s going on, but often the biggest issues and problems are conceptual We use mathematics in statistics, but we also use computer programming, graphic design, substantive expertise, and creativity, just to name a few other things!

Why No Precocious Statisticians?
The statistician M.G. Kendall on The Future of Statistics (1968): “…although musicians are often precocious, poets never are. One can draw the same kind of distinction between mathematicians, who are usually precocious, and statisticians who, as statisticians, are not. There is a certain apprenticeship in handling real-life situations to be served before an individual is mature enough to tackle important statistical problems.”

Myth 2: Statistics is Dry or Dull

Data Everywhere Exponential growth and availability of all sorts of data (“Internet of Things”) (2) Huge gap between exponential growth of data and number of people who can analyze these data (3) Need for analysts who can “tell stories” or “give meaning” This is why statistics is not equivalent to math Bottom line: we have more data than people to analyze them!

The World Needs You! The Amount of Data is Growing Exponentially At no other time do we more urgently need careful data analysts and thinkers such as yourself to find those hidden facts and patterns that will change our world!

Statistics as “Data Science”

What is “Data Science”? Data science is a recently coined term from the 1970s 1988: the statistician C.F. Jeff Wu elaborated on the phrase in a talk titled “Statistics = Data Science?” Argued that statisticians should rebrand themselves as “data scientists” since spend most of their time manipulating and experimenting with data Until recently, data scientists have tended to focus more on cleaning datasets and programming My own view: “data science” = “applied statistics”

Data Science Venn Diagram
Statistical problems and concepts! Stata and R are programming languages Stat 102 = “Data Science”! Upcoming regression project will help us delve deeper into the middle of this Venn Diagram From medicine, biology, or your own experience! “From what I know about the world, smoking is related to cancer.”

To Recap: Statistics as “Data Science”
Statistics has been around for a relatively long time “Data Science” is a new buzzword reflecting the huge increase in data over the past several decades No clear-cut definition My own view: “data science” can be understood as “applied statistics” Not just theory, but practical application is important!

Myth 3: Statistics is Irrelevant

Statistics is Everywhere
Statistics is used in every field today With statistics we can gain new knowledge about everything from healthy eating to the fundamental particles of the Universe Statistics is also foundational to constructing and building things, from skyscrapers to airplanes to websites It is also in extremely high demand

Example 1: Best Jobs of 2014

Example 2: High Demand for Data Analysts

I just picked these examples from the news from just one week
I just picked these examples from the news from just one week. These articles appear all the time. Statistics is in very high demand!

Myth 4: Statistics is Only for Scientists

Interpreting Test Results

Importance of Statistical Literacy

Many Other Fields Use Statistics!
The applications are endless: sociology, anthropology (including cultural anthropology), political science, psychology, and so forth! We will cover these applications through examples over the semester.

Four Misconceptions About Statistics Revisited
There are several, but I will begin with four: Statistics is creativity, programming, communication, personal experience, graphic design – and mathematics Statistics is “sexy” and “data science” Statistics is not only relevant, but essential Statistics is incredibly useful for not only scientists but medical doctors, anthropologists, historians (and others)

Overview Lecture 1: Introduction to Data

Course Map Descriptive Statistics Introduction to Data

Overview of Week 1: Introduction to Data
Part 1: Types of Data Part 2: Samples and Populations Part 3: Basics of Study Design

Part 1 Types of Data

What is Statistics? Statistics is the study of how best to collect, analyze, and draw conclusions from data Modern statistics as a field developed only in the early 20th century, but the foundations lie much earlier Statistics can be viewed in the context of the general process of scientific investigation

Process of Investigation
Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion Example: Will aspirin help stop my headache?

Everyday Examples We do this all the time!
Does working out in the morning make me more energized? Will eating a watermelon relieve my thirst? If I study one hour a day, can I ace the final exam? Can I lose weight if I just avoid cream with my coffee in the morning? We do this all the time!

Goal of Statistics Identify a question or problem
Collect relevant data on the topic Analyze the data Make a conclusion Statistics is focused on making steps 2 to 4 more rigorous, reproducible, and reliable

Substantive Focus for this Course
Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion We will (in general) focus on questions or problems related to the social sciences and humanities

What are Data? Data are a set of observations
These observations can be collected in various ways, such as through field notes, surveys, and experiments Often (but not always) in the social sciences and humanities, the focus is on collecting data on people

Data or Datum? The singular form is “datum,” so we say “that datum is very low” “Data” is the plural form, but it may be used with a singular or plural verb Both are grammatically correct: “The data were analyzed after they were collected.” “The data was analyzed after it was collected.”

Real-World Examples We will explore these kind of questions
Do kids think sports is important for being popular? Were men more likely to die on the Titanic than women? Could we have predicted who won the presidential election in the United States? We will explore these kind of questions in this course!

Obama vs. McCain in 2008 Source: Wikipedia

2008 Election Results Source: Wikipedia

Back in October 2008 Suppose you were a political analyst in October 2008 who wanted to predict the election You’ve had an assistant collect data on the characteristics of U.S. states How did your assistant organize the data so you could analyze it?

Organizing Data Data are usually organized into a dataset (or dataframe) The rows of the of the dataset are called cases (or observational units) The columns of the dataset are called variables The variables are characteristics or traits of the cases

Variables (characteristics of U.S. states)
Dataset Organization Variables (characteristics of U.S. states) Cases (U.S. states) … … … …

Keeping Track of Cases Each row is a different case (U.S. state)
We sometimes index the rows using the letter 𝑖 What is the case if 𝑖=5 row? What is the case if 𝑖=49? … …

How Many Cases? The total number of cases in a dataset is labeled by 𝑛
Our dataset has 𝑛=50 cases We typically index the rows of a dataset from 𝑖=1 to 𝑛 … …

Many Different Kinds of Cases
Cases (that is, the rows of the dataset) need not be U.S. states Other examples: Respondents who answer a survey Subjects or respondents involved in an experiment Records of transactions or objects Different historical events

Each column is a different variable or characteristic of the cases:
… … … … State: Name of the U.S. state Region: Geographic region of U.S. state Population: Population (in millions) HighSchool: Percent who graduated High School

Be Careful of the Units of Measurement
… … … … Example: What is population of Alabama? About 4.53 million people, not 4.53 people!

Many Types of Variables
In general, variables can be considered either numerical or categorical All Variables Numerical Continuous Discrete Categorical Nominal Ordinal

Numerical Variables Numerical variables have values that are numbers
For numerical variables, it usually makes sense to add, subtract, or take averages with those values Examples: Age of respondent, miles per gallon in a car, ranking of colleges

Continuous vs. Discrete
Any number can be a value for continuous variables Examples: Height in inches, percent of children living in poverty, number of miles moved away from childhood home Discrete variables can only have values that are non-negative counting numbers (0, 1, 2, 3, and so on) Examples: Number of votes for a politician, rating of a restaurant on a 5-point scale, number of arrivals at an airport

Continuous or Discrete?
Annual household income of a respondent in thousands of U.S. dollars Continuous ($10, $101.56, $17.3) Rating from 1 to 4 stars of a movie Discrete (1, 2, 3, 4 only) Number of murders last year in a neighborhood Discrete (0, 1, 2, 3, 4, 5, and so on)

Categorical Variables
Categorical variables have values that are different categories (or qualities) Sometimes a categorical variable is called a factor and the different categories are called levels Examples: Gender of respondent, approval or disapproval of a social policy, regular smoker or not

Nominal vs. Ordinal Nominal variables have no inherent order to their categories Examples: Religious identification, political party affiliation, names of countries visited Ordinal variables usually have an expected ordering to their categories Examples: Highest educational degree attained, level of approval in a country’s economy policy (from “Strongly Disapprove” to “Strongly Approve”) Not always clear-cut!

Numerical vs. Categorical
For this course we will focus mainly on the distinguishing between numerical and categorical variables All Variables Numerical Continuous Discrete Categorical Nominal Ordinal

Numerical and Categorical Variables (characteristics of U.S. states)
Our Dataset Revisited Numerical and Categorical Variables (characteristics of U.S. states) Cases (U.S. states) … … … …

What Type of Variable? Categorical (nominal)
… … … … State: Name of the U.S. state Categorical (nominal) Categories (or levels) are “Alabama,” “Alaska,” “Arizona,” … “Wyoming”

What Type of Variable? Categorical (nominal)
… … … … Region: Geographic region of U.S. state Categorical (nominal) Categories (or levels) are “Northeast,” “Midwest,” “West,” “South”

What Type of Variable? Numerical (continuous)
… … … … HighSchool: Percent who graduated High School Numerical (continuous) Values are numbers such as 82.4, 90.2, 91.9, and so on

What Type of Variable? Numerical (continuous)
… … … … Population: Population (in millions) Numerical (continuous) Values are numbers such as , , , and so on

Summary: Categorical (Nominal) Numerical (Continuous) Cases
(U.S. states) … … … …

Things to Know Data are a set of observations about the world around us To prepare for data analysis, we organize our data into a dataset The rows of a dataset are the cases and the columns are variables Variables may be numerical or categorical

Part 2 Samples and Populations

Research Questions What is the average number of words spoken daily among 10-year-old schoolchildren in New York City? Do citizens of the United States approve or disapprove of legalizing marijuana use? Does meditating for 15 minutes each day improve happiness among college students in Beijing?

Defining a Population Each research question refers to a target population A population is the entire collection of cases about which information is desired For example, if we want to know the percentage of Buddhists in Japan, the population is all people in Japan

Name the Population Research Question: What is the average number of words spoken daily among 10- year-old schoolchildren in New York City? Target population: All 10-year-old children who attend school in New York City

Name the Population Research Question: Do citizens of the United States approve or disapprove of legalizing marijuana use? Target population: All citizens of the United States

Name the Population Does meditating for 15 minutes each day improve happiness among college students in Beijing? Target population: All college students in Beijing

Conducting a Census Collecting data on the population is called conducting a census Censuses conducted throughout history for purposes of empire-building Taxation of citizens Aid reproduction of population Recruit able-bodied men for war

Example: Census of Turin, Italy
The French were attacking Turin, Italy in 1705 In part to recruit men for war, city officials surveyed all inhabitants Variables included name, age, gender, birthplace, and weapons kept in the household

Numerical and Categorical Variables (characteristics of residents)
Turin, Italy 1705 Census Numerical and Categorical Variables (characteristics of residents) Cases (Residents of Turin, Italy) … … … …

Problems with a Census Often very difficult and expensive to conduct
Hard-to-reach groups such as undocumented migrants and homeless can further increase the cost further Ethical concerns about requiring everyone to participate in the census May constitute a violation of one’s privacy

Solution: Collect a Sample
Rather than conducting a census, we can collect a sample which is a subset of the population. Population Sample

We Take Samples All the Time
Suppose you’re at a giant buffet with hundreds of different entrees Most people will try out some subset of the entrees before making a conclusion about what to eat People also sample everyday when buying things, listening to music, and making decisions

From Samples to Populations
Because samples are taken so frequently, we often distinguish between descriptive and inferential statistics Descriptive statistics is organizing and presenting data from a sample or population Inferential statistics is making conclusions about a population based only on data from a sample

Descriptive Statistics
Presenting data For example, tables and graphs Summarizing data For example, average height of respondents in a sample For example, percentage men in a country based on Census data

Inferential Statistics
Estimating For example, using the average height from a sample to estimate the average height in a population Hypothesis Testing For example, using a sample of data to test the claim that the average height in a population is 72 inches

Sample Statistics and Population Parameters
A statistic is a numerical summary based on a sample For example, the average weight in a sample of hospital patients in London A parameter is a numerical summary of a population For example, the average weight of all hospital patients in London

Sample Size versus Population Size
It’s often useful to keep track of whether your data is based on a sample from a population or a census of a population If your dataset is a sample of a population, the number of cases is labeled 𝑛 If your dataset consists of the entire population, then the number of cases is labeled 𝑁

A Note on Notation Sample statistics are (usually but not always) expressed in Latin letters For example, the average from a sample is referred to as 𝑥 (pronounced “x-bar”) Population parameters are (usually but not always) expressed in Greek letters For example, the average from a population is referred to as 𝜇 (pronounced “mu”)

Research Questions Revisited
Research Question: What is the average number of words spoken daily among 10- year-old schoolchildren in New York City? “The news had a segment on how young teenagers spend most of their time text messaging rather than talking, so the average number of words spoken each day can’t be more than 500.”

Questions Revisited Research Question: Do citizens of the United States approve or disapprove of legalizing marijuana use? “I met an elderly couple from Oklahoma who really dislike marijuana, so clearly it’s widely disapproved by people in the United States.”

Questions Revisited Does meditating for 15 minutes each day improve happiness among college students in Beijing? “I started meditating for 30 minutes each day and I’m happier, so college students who meditate must be happier, too.”

Anecdotal Evidence These are all anecdotal evidence, based on a limited sample size that might not be representative of the population Anecdotal evidence is typically composed of unusual cases that we recall based on their striking characteristics Instead of anecdotes, we should sample appropriately from the population or conduct a Census

Things to Know A population is the entire collection of cases about which information is desired A sample is a subset of a population We often distinguish between descriptive and inferential statistics Anecdotal data is highly unreliable

Part 3 Basics of Study Design

“To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.” – R.A. Fisher

Process of Investigation
Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion We collect data by designing a statistical study!

Types of Statistical Studies
Observational Cross-sectional Longitudinal (or Panel) Experimental Observational studies entail observation and measurement, but no intervention Experimental studies entail the application of some intervention

Observational Studies
We typically cannot prove cause and effect with observational studies A survey is a typical observational study In a cross-sectional study, data are collected at one point in time on a set of individuals In a longitudinal (or panel) study, data are collected over time on the same individuals

Example: Obesity Epidemic
Tracking the same people over time from to 2003, social scientists used survey data to examine the spread of obesity through social networks The authors conclude: “Our study suggests that obesity may spread in social networks in a quantifiable and discernable pattern.” This is a longitudinal observational study

Be Careful with Interpreting Results
In an observational study, there can always be confounding (or lurking) variables affecting the results This means that observational studies can almost never show causation It is easier to adjust for confounding variables in an experiment

Confounding Variables
A confounding variable is a variable not included in the study design that has an effect on the variables studied For example, if you just study sunscreen and cancer, you might find conclude that sunscreen is causing cancer (when in fact you’re omitting the level of sun exposure)

Experimental Studies Experimental studies entail the application of some intervention (or treatment) Often used in psychology and medicine Individuals or subjects are randomly assigned to treatment and control groups Try to remove known effect of confounders by controlling the environmental conditions Accordingly called randomized controlled trials

Example: Neighborhoods and Obesity
Women and children living in poor neighborhoods were randomly assigned to have an opportunity to live in wealthier neighborhoods Conclusion: “The opportunity to move from a neighborhood with a high level of poverty to one with a lower level of poverty was associated with modest but potentially important reductions in the prevalence of extreme obesity and diabetes.”

Disadvantages of Experiments
They can be unethical if conducted on vulnerable populations or the treatment is known to be harmful It can be difficult to monitor subjects to ensure that they are doing what they are told Results of experiments that use animals do not necessarily generalize to humans They can take many years, even decades, to complete

Things to Know Observational studies entail observation and measurement, but no intervention Experimental studies entail the application of some intervention Observational studies can be cross-sectional or longitudinal It can be difficult to say anything about causation in observational studies because of confounding variables

Course Map Descriptive Statistics Introduction to Data

End of Lecture 1: Introduction to Data
Stat E-100, Harvard University Ethan Fosse, Ph.D.

Lecture 1: Introduction to Data Ethan Fosse, Ph.D.

Similar presentations

Presentation on theme: "Lecture 1: Introduction to Data Ethan Fosse, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 1: Introduction to Data Ethan Fosse, Ph.D.

Similar presentations

Presentation on theme: "Lecture 1: Introduction to Data Ethan Fosse, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback