Choose n out of these k objects For example:  Choose your three favorites out of these ten photographs  Of these fifty apps, which ten would you download.

Slides:



Advertisements
Similar presentations
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Chapter 7: Sampling Distributions
Sampling Distributions
The Diversity of Samples from the Same Population Thought Questions 1.40% of large population disagree with new law. In parts a and b, think about role.
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Modelling Cardinal Utilities from Ordinal Utility data: An exploratory analysis Peter Gilks, Chris McCabe, John Brazier, Aki Tsuchiya, Josh Solomon.
The Practice of Statistics
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
Determining the Size of
Section 9.3 Sample Means.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
AP Statistics Section 13.1 A. Which of two popular drugs, Lipitor or Pravachol, helps lower bad cholesterol more? 4000 people with heart disease were.
Chapter 7 Sampling Distributions
10.3 Estimating a Population Proportion
9/23/2015Slide 1 Published reports of research usually contain a section which describes key characteristics of the sample included in the study. The “key”
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Independence and Bernoulli.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
1 Psych 5500/6500 Populations, Samples, Sampling Procedures, and Bias Fall, 2008.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Yaomin Jin Design of Experiments Morris Method.
Section Using Simulation to Estimate Probabilities Objectives: 1.Learn to design and interpret simulations of probabilistic situations.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
MM150 Unit 8 Seminar. Probability (Unit 7) Statistics (Unit 8) : Gathering data; organizing data Statistics (Unit 9) : Analyzing data; making conclusions.
Chapter 8 Sampling Variability and Sampling Distributions.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling and Sampling Distributions.
Chapter 18: Sampling Distribution Models
Part III – Gathering Data
Prewriting STARTING YOUR PAPER COPYRIGHT LISA MCNEILLEY, 2010.
Ch. 17 – Probability Models (Day 1 – The Geometric Model) Part IV –Randomness and Probability.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Ch. 18 – Sampling Distribution Models (Day 1 – Sample Proportions) Part V – From the Data at Hand to the World at Large.
Unit 7: Sampling Distributions
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Plan for Today: Chapter 1: Where Do Data Come From? Chapter 2: Samples, Good and Bad Chapter 3: What Do Samples Tell US? Chapter 4: Sample Surveys in the.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 7: Sampling Distributions Section 7.1 What is a Sampling Distribution?
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 7 Sampling Distributions 7.1 What Is A Sampling.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Chance We will base on the frequency theory to study chances (or probability).
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Sampling Dr Hidayathulla Shaikh. Contents At the end of lecture student should know  Why sampling is done  Terminologies involved  Different Sampling.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Ch. 18 – Sampling Distribution Models (Day 1 – Sample Proportions)
A Bayesian approach to recommender systems
Part III – Gathering Data
Chapter 18: Sampling Distribution Models
Sampling Distribution Models
Daniela Stan Raicu School of CTI, DePaul University
Sampling Distributions
Chapter 8: Estimating with Confidence
Chapter 7: Sampling Distributions
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
What do Samples Tell Us Variability and Bias.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 5: Sampling Distributions
Presentation transcript:

Choose n out of these k objects For example:  Choose your three favorites out of these ten photographs  Of these fifty apps, which ten would you download to your phone?  Which two of these seven movies would you want to watch?

Could we prove that that there is dependence within each person’s choices? For example, do people have a certain “taste” in sushi rolls? Objectives:  We wanted to prove that each person does not choose randomly. Some items are chosen together more often than they would be otherwise.  In particular, we wanted to find which items are similar to one another. If a person chooses a given object, which other objects is he also more likely to choose?

 SUSHI Preference Data Set -- survey taken by 5000 people in which they were asked to rank ten different types of rolls from best to worst ( )  The ten rolls: shrimp (0), sea eel (1), tuna (2), squid (3), sea urchin (4), salmon (5), egg (6), fatty tuna (7), tuna roll (8), cucumber (9)  We just looked at each respondent’s first three choices and ignored the order in which they listed them. (This way, the data fit our “choose n out of k” format.)

x 1394x x x x x x x x x The following is a matrix of how often each pair of sushis appeared together in someone’s top three: Most popular pairs Least popular pair

Doesn’t this answer our questions?  The most popular pairings were (2,7) and (4,7). So those who like roll #7 were more likely to choose roll #2 or #7.  The least popular pairing was (5,9) – only 21 respondents listed them as two of their top three! They must be very dissimilar.

That ignores the fact that some rolls were just more popular overall. It makes sense that (2,7) and (4,7) were chosen together so often since 2, 4, and 7 were popular overall. The reverse is true for 5 and 9. There’s no clear proof that these pairings tell us anything about people’s taste – they may just reflect each roll’s popularity.

We needed to generate a matrix of how often each pair of rolls would be expected to appear together. We could then compare the actual results to the expected results. To generate this matrix, we decided to run a simulation.

 Each respondent needs to randomly choose three rolls  The rolls must be chosen without replacement – each respondent needs to choose three different rolls  Each roll’s overall popularity must be held fixed

 Simply choose three rolls out of ten without replacement, using sample(0:9,3,replace=FALSE,prob=P 1,P 2,…) in R  Imagine that a number line between 0 and 3 is split up into 10 parts where the size of each part is proportional to the frequency of each subsequent roll.  A random number between 0 and 3 is then generated, corresponding to one of the rolls. For example, if 1.4 was generated, then roll #4 would be chosen.

 A new number line is then drawn, leaving out whichever roll was chosen the first time, while proportionally increasing the size of each remaining part. For example, this would be the new number line if #4 were chosen:  Once again, a number between 0 and 3 would be chosen, corresponding to the second roll chosen.  This same process would be repeated to choose the third roll.

 We have to redraw the number line after the first choice. As a result, the probabilities for the second and third choices are not the same as the overall probabilities.  The overall distribution of choices from the simulation is not equal to the overall distribution of choices from the actual survey: How can we fix this? We somehow need to keep the overall probabilities constant for each choice, while still not allowing for repeats Actual Frequency Simulated

Hartley and Rao (1962) describe an approach to solve this problem: 1. Randomize the order of the rolls. This was accomplished by calling sample(0:9) in R. 2. Split up the number line between 0 and 3 into 10 parts where the size of each part was still proportional to the frequency of each subsequent roll, but using the new order. For example, when the new order of the roll is [3,7,5,9,1,2,4,0,8,6] we use the following number line:

3. A random number between 0 and 1, d, is chosen. 4. The three rolls selected are the ones corresponding to d, d+1, and d+2. In the following example d =.95, meaning that rolls 5, 2, and 6 – the rolls corresponding to.95, 1.95, and 2.95 – are chosen.

Our simulation shows that each roll is chosen with the same frequency using this technique as in the actual survey Actual Frequency Technique # Technique #

Using this second method, we found our matrix of expected results. The fact that our expectations were so different from the actual data implies that people don’t make their choices independently x x x x x x x x x x

x x x x X x x x x x *Remember how 2 and 7 initially seemed to be the most similar pair? It still looks like they are similar, but there are many other pairings which are much more similar. For example, 6 and 9 were chosen together only 66 times yet has a larger residual!

0 - shrimp 1 - sea eel 2 - tuna 3 - squid 4 - sea urchin 5 - salmon 6 - egg 7 - fatty tuna 8 - tuna roll 9 - cucumber

To further support these results, we re-ran the analysis by looking at each respondent’s top five choices. These were the results of the new multidimensional scaling: The fact that this plot is so similar to our prior one (see previous slide) proves that our results were not merely a result of the fact that we arbitrarily chose to look at the top three choices and that any value of k and n (where k<n) should work.

The groupings made by the MDS make sense when we look back at what each type of roll was.

Look at the clusters it formed:  6 and 9  Egg and Cucumber, the two non- fish choices  2, 7, and 8  All three are different types of tuna rolls Since those clusters make sense on their own, and were confirmed by our statistical analysis, we could also trust the other clusters we formed:  4 and 5  Sea Urchin and Salmon  0, 1, and 3  Shrimp, Sea Eel, Squid

 In our study, we looked at associations in choice data using simulations.  The simulation was done by sampling without replacement yet still proportional to size.  We showed that people did not make their choices randomly.  MDS and clustering based on the identified associations revealed the specifics of people’s taste.  This general approach can be readily applied to other choice data.