Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.

Slides:



Advertisements
Similar presentations
Introduction to Non Parametric Statistics Kernel Density Estimation.
Advertisements

CIS Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Ch. 17 Basic Statistical Models CIS 2033: Computational Probability and Statistics Prof. Longin Jan Latecki Prepared by: Nouf Albarakati.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
1 Chapter 1: Sampling and Descriptive Statistics.
Continuous Random Variables and Probability Distributions
Chapter 6 Introduction to Sampling Distributions
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Horng-Chyi HorngStatistics II127 Summary Table of Influence Procedures for a Single Sample (I) &4-8 (&8-6)
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Chapter 11: Inference for Distributions
CHAPTER 1: Picturing Distributions with Graphs
Continuous Probability Distributions A continuous random variable can assume any value in an interval on the real line or in a collection of intervals.
Chapter 7: The Normal Probability Distribution
Continuous Probability Distributions  Continuous Random Variable  A random variable whose space (set of possible values) is an entire interval of numbers.
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter The Normal Probability Distribution 7.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Modular 11 Ch 7.1 to 7.2 Part I. Ch 7.1 Uniform and Normal Distribution Recall: Discrete random variable probability distribution For a continued random.
Normal Distribution Introduction. Probability Density Functions.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
IT College Introduction to Computer Statistical Packages Eng. Heba Hamad 2009.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
The Central Tendency is the center of the distribution of a data set. You can think of this value as where the middle of a distribution lies. Measure.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Random Sampling Approximations of E(X), p.m.f, and p.d.f.
© 2010 Pearson Prentice Hall. All rights reserved Chapter The Normal Probability Distribution © 2010 Pearson Prentice Hall. All rights reserved 3 7.
Business Statistics (BUSA 3101). Dr.Lari H. Arjomand Continus Probability.
Statistics Lecture 3. Last class: types of quantitative variable, histograms, measures of center, percentiles and measures of spread…well, we shall.
Math 3033 Wanwisa Smith 1 Base on text book: A Modern Introduction to Probability and Statistics Understanding Why and How By: F.M. Dekking, C. Kraaikamp,
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Chapter 16 Exploratory data analysis: numerical summaries CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics Instructor:
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter The Normal Probability Distribution 7.
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Chapter 7 The Normal Probability Distribution 7.1 Properties of the Normal Distribution.
Probability and Statistics 12/11/2015. Statistics Review/ Excel: Objectives Be able to find the mean, median, mode and standard deviation for a set of.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
CHAPTER 12 More About Regression
Continuous Distributions
Alt Text to Graphs.
Probability and Statistics
Chapter 16: Exploratory data analysis: numerical summaries
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
The Normal Probability Distribution
BAE 5333 Applied Water Resources Statistics
Properties of the Normal Distribution
CHAPTER 12 More About Regression
CHAPTER 1: Picturing Distributions with Graphs
The Normal Probability Distribution
Topic 5: Exploring Quantitative data
DS4 Interpreting Sets of Data
Exploratory data analysis: numerical summaries
CIS 2033 Base on text book: A Modern Introduction to
CHAPTER 12 More About Regression
CHAPTER 1: Picturing Distributions with Graphs
CHAPTER 12 More About Regression
Summary Table of Influence Procedures for a Single Sample (I)
Advanced Algebra Unit 1 Vocabulary
The Normal Distribution
Presentation transcript:

Chapter 15: Exploratory data analysis: graphical summaries CIS 3033

15.1 Example: the Old Faithful data Statistics: the collection, analysis, and interpretation of data. The set of observations is called a dataset. Assumption: the randomness in a dataset roughly follows a probability model. From Data to Model (the reverse of simulation)simulation It is often necessary to condense the data for easy visual comprehension of general characteristics.

15.1 Example: the Old Faithful data

The durations (in seconds) of 272 eruptions of the Old Faithful geyser is collected. The variety in the lengths of the eruptions indicates that randomness is involved, but what can be said about the distribution? The mean of the data is Putting the elements in order shows that they are all in [96, 306], with 240 as median.Such numerical summaries are covered in detail in the next chapter.

15.2 Histograms Graphical summary: group similar data and show their distribution visually.

15.2 Histograms A version of histogram: the total area under the curve is equal to 1, so the histogram can be seen as an approximation of the density function. Steps: 1.Divide the range of the data into bins (intervals), which usually (though not necessarily) have the same width. 2.The height of the histogram on a bin is (the number of elements in the bin) / [(the number of all elements)*(the width of the bin)]

15.2 Histograms Let r be a reference point smaller than the minimum of the dataset, and b the bin width, then B i = (r + (i − 1)b, r + ib] for i = 1, 2,...,m We may let m = log 10 (n) or b = 3.49sn −1/3 where s is the sample standard deviation.

15.3 Kernel density estimates Idea: “put a pile of sand” around each data element, so as to contribute to its neighborhood continuously.

15.3 Kernel density estimates The plot is constructed by choosing a kernel K and a bandwidth h. The kernel reflects the shape of the "piles of sand", whereas the bandwidth determines how wide the piles of sand will be. A kernel K typically satisfies the following conditions: (K1) K is a probability density function; (K2) K is symmetric around zero, i.e., K(u) = K(−u); (K3) K(u) = 0 for |u| > 1. Roughly, histograms can be seen as formed with uniform kernels on bins.

15.3 Kernel density estimates

1: 2:3: Three steps to construct a kernel density estimate:

15.3 Kernel density estimates Choice of the bandwidth: too small and too large are both bad. A good choice: h = 1.06 sn −1/5, where s is the sample standard deviation.

15.3 Kernel density estimates Choice of the kernel is less important, since different kernels may produce similar results. When symmetric kernel is improper, boundary kernel can be used.

15.4 The empirical distribution function The empirical cumulative distribution function of the data: For example, if the data is , then

15.4 The empirical distribution function

15.5 Scatterplot In the case of two variables x and y, the dataset consists of pairs of observations: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). In a scatterplot, each pair is shown as a point.

15.5 Scatterplot