CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Slides:



Advertisements
Similar presentations
Chapter 5 One- and Two-Sample Estimation Problems.
Advertisements

Sampling Distributions and Estimators
0 - 0.
Addition Facts
Class 6: Hypothesis testing and confidence intervals
SADC Course in Statistics Estimating population characteristics with simple random sampling (Session 06)
Point Processing Histograms. Histogram Equalization Histogram equalization is a powerful point processing enhancement technique that seeks to optimize.
Introduction to Non Parametric Statistics Kernel Density Estimation.
CHAPTER 2 – DISCRETE DISTRIBUTIONS HÜSEYIN GÜLER MATHEMATICAL STATISTICS Discrete Distributions 1.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
1 Random Sampling - Random Samples. 2 Why do we need Random Samples? Many business applications -We will have a random variable X such that the probability.
Ch. 17 Basic Statistical Models CIS 2033: Computational Probability and Statistics Prof. Longin Jan Latecki Prepared by: Nouf Albarakati.
POINT ESTIMATION AND INTERVAL ESTIMATION
Ch. 19 Unbiased Estimators Ch. 20 Efficiency and Mean Squared Error CIS 2033: Computational Probability and Statistics Prof. Longin Jan Latecki Prepared.
Random Sampling and Data Description
4. FREQUENCY DISTRIBUTION
Sampling Distributions (§ )
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics Slides by Michael Maurizi Instructor Longin Jan Latecki C9:
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Continuous Random Variables and Probability Distributions.
CHAPTER 1: Picturing Distributions with Graphs
Chapter 7 Estimation: Single Population
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Continuous Probability Distributions  Continuous Random Variable  A random variable whose space (set of possible values) is an entire interval of numbers.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Dr. Asawer A. Alwasiti.  Chapter one: Introduction  Chapter two: Frequency Distribution  Chapter Three: Measures of Central Tendency  Chapter Four:
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics Instructor Longin Jan Latecki C22: The Method of Least Squares.
Ch5. Probability Densities II Dr. Deshi Ye
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics Michael Baron. Probability and Statistics for Computer Scientists,
The hypothesis that most people already think is true. Ex. Eating a good breakfast before a test will help you focus Notation  NULL HYPOTHESIS HoHo.
Random Sampling Approximations of E(X), p.m.f, and p.d.f.
Statistics Lecture 3. Last class: types of quantitative variable, histograms, measures of center, percentiles and measures of spread…well, we shall.
Math 3033 Wanwisa Smith 1 Base on text book: A Modern Introduction to Probability and Statistics Understanding Why and How By: F.M. Dekking, C. Kraaikamp,
CIS 2033 A Modern Introduction to Probability and Statistics Understanding Why and How Chapter 17: Basic Statistical Models Slides by Dan Varano Modified.
Ch. 14: Markov Chain Monte Carlo Methods based on Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009.; C, Andrieu, N, de Freitas,
AP Statistics Semester One Review Part 1 Chapters 1-3 Semester One Review Part 1 Chapters 1-3.
STATISTICS AND OPTIMIZATION Dr. Asawer A. Alwasiti.
Chapter 16 Exploratory data analysis: numerical summaries CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics Instructor:
Computer simulation Sep. 9, QUIZ 2 Determine whether the following experiments have discrete or continuous out comes A fair die is tossed and the.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Histograms PowerPoint Prepared by Alfred P. Rovai.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics B: Michael Baron. Probability and Statistics for Computer Scientists,
Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.
14.6 Descriptive Statistics (Graphical). 2 Objectives ► Data in Categories ► Histograms and the Distribution of Data ► The Normal Distribution.
Linear Algebra Review.
Chapter 16: Exploratory data analysis: numerical summaries
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Chapter 2: Methods for Describing Data Sets
Chapter 16: Exploratory data analysis: Numerical summaries
Distributions cont.: Continuous and Multivariate
Chapter 2 Descriptive Statistics: Tabular and Graphical Methods
Topic 5: Exploring Quantitative data
MEGN 537 – Probabilistic Biomechanics Ch.3 – Quantifying Uncertainty
Histograms REVIEWED Histograms are more than just an illustrative summary of the data sample. Typical examples are shown below (in R: see help(hist) for.
MATH 3033 based on Dekking et al
2-1 Data Summary and Display 2-1 Data Summary and Display.
CIS 2033 based on Dekking et al
Exploratory data analysis: numerical summaries
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
CIS 2033 Base on text book: A Modern Introduction to
C19: Unbiased Estimators
Sampling Distributions (§ )
Histograms are plots that show the distribution of data.
Probability and Statistics
Basic Practice of Statistics - 3rd Edition The Normal Distributions
C19: Unbiased Estimators
MATH 3033 based on Dekking et al
Presentation transcript:

CIS Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding Why and How Instructor: Dr. Longin Jan Latecki

 The set of observations is called a dataset.  By exploring the dataset we can gain insight into what probability model suits the phenomenon.  To graphically represent univariate datasets, consisting of repeated measurements of one particular quantity, we discuss the classical histogram, the more recently introduced kernel density estimates and the empirical distribution function.  To represent a bivariate dataset, which consists of repeated measurements of two quantities, we use the scatterplot. 2 Chapter 15 Exploratory data analysis: graphical summaries

15.2 Histograms: The term histogram appears to have been used first by Karl Pearson. 3

Histogram construction and pdf 4 Denote a generic (univariate) dataset of size n by First we divide the range of the data into intervals. These intervals are called bins and denoted by The length of an interval B i is denoted by ǀ B i ǀ and is called the bin width. We want the area under the histogram on each bin B i to reflect the number of elements in B i. Since the total area 1 under the histogram then corresponds to the total number of elements n in the dataset, the area under the histogram on a bin B i is equal to the proportion of elements in B i : The height of the histogram on bin B i must be equal to As we know from Ch. 13.4, the histogram approximates the pdf f, in particular, for a bin centered at point a, B a =(a-h, a+h], we have

5 The function g in blue is a mixture of two Gaussians. We draw 200 samples from it, which are shown as blue dots. We use the samples to generate the histogram (yellow) and its kernel density estimate f (red). The Matlab script is twoGaussKernelDensity1.mtwoGaussKernelDensity1.m In Matlab: binwidth=0.5; bincenters=[0.5:binwidth:9.5]; hx=hist(x,bincenters)/(200*binwidth);

Choice of the bin width 6 Consider a histogram with bins of equal width. In that case the bins are of the from where r is some reference point smaller than the minimum of the dataset and b denotes the bin width. Mathematical research, however, has provided some guide- line for a data-based choice for b or m, where s is the sample std:

15.3 Kernel density estimates 7

A kernel K is a function K:R  R and a kernel K typically satisfies the following conditions. 8

Examples of Kernel Construction 9

Scaling the kernel K 10 Scale the kernel K into the function Then put a scaled kernel around each element xi in the dataset

11 The bandwid th is too small The bandwidth is too big

12 The function g in blue is a mixture of two Gaussians. We draw 200 samples from it, which are shown as blue dots. We use the samples to generate the histogram (yellow) and its kernel density estimate f (red). The Matlab script is twoGaussKernelDensity1.mtwoGaussKernelDensity1.m

15.4 The empirical distribution function 13 Another way to graphically represent a dataset is to plot the data in a cumulative manner. This can be done by using the empirical cumulative distribution function.

Empirical distribution function Continued 14

Example Given is the following information about a histogram, compute the value of the empirical distribution function at point t = 7: By: Wanwisa Smith 15 Because (2 - 0) * (4 - 2) * (7 - 4) * (11 - 7) * ( ) * = 1, there are no data points outside the listed bins. Hence

Relation between histogram and empirical cdf Given is a histogram and the empirical distribution function F n of the same dataset. Show that the height of the histogram on a bin (a, b] is equal to By: Wanwisa Smith 16 The height of the histogram on a bin B i = (a, b] is Hence

15.5 Scatterplot 17 In some situation we might wants to investigate the relationship between two or more variable. In the case of two variables x and y, the dataset consists of pairs of observations: We call such a dataset a bivariate dataset in contrast to the univariate. The plot the points (X i, Y i ) for i = 1, 2, …,n is called a scatterplot.