STATISTICS Exploratory Data Analysis and Probability

Slides:



Advertisements
Similar presentations
STATISTICS Introduction
Advertisements

STATISTICS Sampling and Sampling Distributions
STATISTICS Introduction
R_SimuSTAT_1 Prof. Ke-Sheng Cheng Dept. of Bioenvironmental Systems Eng. National Taiwan University.
STATISTICS Random Variables and Distribution Functions
Week 21 Basic Set Theory A set is a collection of elements. Use capital letters, A, B, C to denotes sets and small letters a 1, a 2, … to denote the elements.
Chapter 4 Probability and Probability Distributions
1 Chapter 1: Sampling and Descriptive Statistics.
Unit 32 STATISTICS.
FREQUENCY ANALYSIS Basic Problem: To relate the magnitude of extreme events to their frequency of occurrence through the use of probability distributions.
Basic Concepts and Approaches
Prof. SankarReview of Random Process1 Probability Sample Space (S) –Collection of all possible outcomes of a random experiment Sample Point –Each outcome.
Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Dept of Bioenvironmental Systems Engineering National Taiwan University Lab for Remote Sensing Hydrology and Spatial Modeling STATISTICS Random Variables.
Chapter 8 Probability Section R Review. 2 Barnett/Ziegler/Byleen Finite Mathematics 12e Review for Chapter 8 Important Terms, Symbols, Concepts  8.1.
Theory of Probability Statistics for Business and Economics.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
Probability Theory 1.Basic concepts 2.Probability 3.Permutations 4.Combinations.
Lecture 03 Prof. Dr. M. Junaid Mughal Mathematical Statistics.
Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Chapter 4 Probability ©. Sample Space sample space.S The possible outcomes of a random experiment are called the basic outcomes, and the set of all basic.
Lecture 2 Dustin Lueker.  Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation.
Dept of Bioenvironmental Systems Engineering National Taiwan University Lab for Remote Sensing Hydrology and Spatial Modeling Introduction STATISTICS Introduction.
Dr. Ahmed Abdelwahab Introduction for EE420. Probability Theory Probability theory is rooted in phenomena that can be modeled by an experiment with an.
Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.
PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY
확률및공학통계 (Probability and Engineering Statistics) 이시웅.
Probability Definition : The probability of a given event is an expression of likelihood of occurrence of an event.A probability isa number which ranges.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Lab for Remote Sensing Hydrology and Spatial Modeling Dept of Bioenvironmental Systems Engineering National Taiwan University 1/45 GEOSTATISTICS INTRODUCTION.
Stochastic Hydrology Random Field Simulation Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
1 Probability- Basic Concepts and Approaches Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Quantitative Techniques – Class I
Probability Theory Basic concepts Probability Permutations
Mathematics Department
STATISTICS Exploratory Data Analysis and Probability
MECH 373 Instrumentation and Measurements
What Is Probability?.
STATISTICS POINT ESTIMATION
Chapter 4 Probability Concepts
PROBABILITY AND PROBABILITY RULES
What is Probability? Quantification of uncertainty.
Chapter 2 Discrete Random Variables
STATISTICS Random Variables and Distribution Functions
Descriptive Statistics
Stochastic Hydrology Hydrological Frequency Analysis (II) LMRD-based GOF tests Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
PROBABILITY AND STATISTICS
Stochastic Hydrology Random Field Simulation
STA 291 Spring 2008 Lecture 6 Dustin Lueker.
STATISTICS INTERVAL ESTIMATION
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Fundamental Graphics in R
Stochastic Storm Rainfall Simulation
Honors Statistics From Randomness to Probability
STOCHASTIC HYDROLOGY Random Processes
Welcome to the wonderful world of Probability
Probability, Statistics
Describing Quantitative Data with Numbers
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
Ticket in the Door GA Milestone Practice Test
6.1 Sample space, events, probability
Advanced Algebra Unit 1 Vocabulary
Discrete Random Variables: Basics
Discrete Random Variables: Basics
STATISTICS HYPOTHESES TEST (I)
Discrete Random Variables: Basics
Presentation transcript:

STATISTICS Exploratory Data Analysis and Probability Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

What is “statistics”? Statistics is a science of “reasoning” from data. A body of principles and methods for extracting useful information from data, for assessing the reliability of that information, for measuring and managing risk, and for making decisions in the face of uncertainty. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

The major difference between statistics and mathematics is that statistics always needs “observed” data, while mathematics does not. An important feature of statistical methods is the “uncertainty” involved in analysis. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Statistics is the discipline concerned with the study of variability, with the study of uncertainty and with the study of decision-making in the face of uncertainty. As these are issues that are crucial throughout the sciences and engineering, statistics is an inherently interdisciplinary science. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Stochastic Modeling & Simulation Building probability models for real world phenomena. No matter how sophisticated a model is, it only represents our understanding of the complicated natural systems. Generating a large number of possible realizations. Making decisions or assessing risks based on simulation results. Conducted by computers. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Exploratory Data Analysis Features of data distributions Histograms Center: mean, median Spread: variance, standard deviation, range Shape: skewness, kurtosis Order statistics and sample quantiles Clusters Extreme observations: outliers Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Histogram: frequencies and relative frequencies A sample data set X 104.838935 265.018615 205.279506 146.938446 12.577133 22.371870 129.538575 37.587841 231.608794 60.397366 24.762863 275.440477 70.721022 100.717110 33.918756 82.708815 149.905426 113.442704 131.144892 9.539663 82.535199 150.761192 134.931864 174.200632 130.360126 115.387515 102.460651 16.480639 9.961515 53.449806 64.158533 133.663194 139.201204 112.180103 105.368124 72.895810 107.569047 81.266071 101.351639 16.652365 85.553281 96.920012 34.202372 45.472935 149.996985 102.347372 19.277535 134.484317 121.101643 10.382787 Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Frequency histogram Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Relative histogram Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Measures of center Sample mean Sample median Sample mean = 98.26067 Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

One desirable property of the sample median is that it is resistant to extreme observations, in the sense that its value depends only the values of the middle observations, and is quite unaffected by the actual values of the outer observations in the ordered list. The same cannot be said for the sample mean. Any significant changes in the magnitude of an observation results in a corresponding change in the value of the mean. Hence, the sample mean is said to be sensitive to extreme observations. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Measures of spread Sample variance and sample standard deviation Range the difference between the largest and smallest values Sample variance = 4039.931 Sample standard deviation = 63.56045 Range = 265.9008 (275.440477 – 9.539663) Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Measures of shape Sample skewness Sample kurtosis Sample kurtosis = 0.533141 (or 3.533141 in R) You need to install the moments package in R in order to calculate the sample skewness and kurtosis. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Order statistics Sample quantiles Linear interpolation Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Box-and-whisker plot (or boxplot) A box-and-whisker plot includes two major parts – the box and the whiskers. A parameter range determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range (IQR) from the box. A value of zero causes the whiskers to extend to the data extremes. Outliers are marked by points which fall beyond the whiskers. Hinges and the five-number summary Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

The summary function in R yields a list of six numbers: In R, a boxplot is essentially a graphical representation determined by the 5NS. Not “linear interpolation” The summary function in R yields a list of six numbers: Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Determining the lower and upper hinges The lower hinge is the median of the lower half of the data, and the upper hinge the median of the upper half of the data. When the number of data points, say n, is even, there are (n/2) data points in the lower and upper halves. When n is odd, there are (n+1)/2 data points in the lower and upper halves. The median is considered as a data point in both the lower and upper halves. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Box-and-whisker plot of X Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Seasonal variation of average monthly rainfalls in CDZ, Myanmar Boxplots are based on average monthly rainfalls of 54 rainfall stations. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

R - Practices Sample data (Sample_Data_1.csv) 104.838935 265.018615 205.279506 146.938446 12.577133 22.371870 129.538575 37.587841 231.608794 60.397366 24.762863 275.440477 70.721022 100.717110 33.918756 82.708815 149.905426 113.442704 131.144892 9.539663 82.535199 150.761192 134.931864 174.200632 130.360126 115.387515 102.460651 16.480639 9.961515 53.449806 64.158533 133.663194 139.201204 112.180103 105.368124 72.895810 107.569047 81.266071 101.351639 16.652365 85.553281 96.920012 34.202372 45.472935 149.996985 102.347372 19.277535 134.484317 121.101643 10.382787 Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

x=scan("Sample_Data_1.csv",sep=",") mode(x);class(x);length(x) x # ---------------------------------------- x1=read.table("Sample_Data_1.csv",sep=",",header=FALSE) mode(x1);class(x1);length(x1) x1 x1[6,3] x1[26] # --------------------------------------- x2=matrix(x,ncol=5) mode(x2);class(x2);length(x2) x2 x2[6,3] x2[26] x3=t(matrix(x,nrow=5)) mode(x3);class(x3);length(x3) x3 x3[6,3] x3[26] # ----------------------------------------- x1[[3]][6] # select sub-objects from a list x1[[3]][3:8] x1[3];length(x1[3]) x1[[3]];length(x1[[3]]) Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

mode(zzz);class(zzz);length(zzz) zzz=list(x1,x3) mode(zzz);class(zzz);length(zzz) mode(zzz[1]);class(zzz[1]);length(zzz[1]) mode(zzz[[1]]);class(zzz[[1]]);length(zzz[[1]]) mode(zzz[[1]][1]);class(zzz[[1]][1]);length(zzz[[1]][1]) zzz[[1]][1] mode(zzz[[1]][[1]]);class(zzz[[1]][[1]]);length(zzz[[1]][[1]]) zzz[[1]][[1]] # mode(zzz[2]);class(zzz[2]);length(zzz[2]) mode(zzz[[2]]);class(zzz[[2]]);length(zzz[[2]]) mode(zzz[[2]][1]);class(zzz[[2]][1]);length(zzz[[2]][1]) zzz[[2]][1] Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Matrices, Data Frame, and List Mode: numerical Class: matrix Data frame Mode: list Class: data frame List Class: list or data frame A matrix is a set of numerical values arranged in matrix format. A matrix is a vector with a dimension vector. A data frame is a particular kind of list. A data frame is a collection of vectors of equal length. These vectors can be numerical, logical, or characters. Although a data frame is a list, its elements can be accessed in a way similar to a matrix. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Random Experiment and Sample Space An experiment that can be repeated under the same (or uniform) conditions, but whose outcome cannot be predicted in advance, even when the same experiment has been performed many times, is called a random experiment. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Examples of random experiments Tossing a coin. Rolling a die. The selection of a numbered ball (1-50) in an urn. (selection with replacement) Occurrences of earthquakes The time interval between the occurrences of two consecutive higher-than-scale 6 earthquakes. Occurrences of typhoons The amount of rainfalls produced by typhoons in one year (yearly typhoon rainfalls). Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

The following items are always associated with a random experiment: Sample space. The set of all possible outcomes, denoted by . Outcomes. Elements of the sample space, denoted by . These are also referred to as sample points or realizations. Events. An event is a subsets of  for which the probability is defined. Events are denoted by capital Latin letters (e.g., A,B,C). Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Definition of Probability Classical probability Frequency probability Probability model Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Classical (or a priori) probability If a random experiment can result in n mutually exclusive and equally likely outcomes and if nA of these outcomes have an attribute A, then the probability of A is the fraction nA/n . Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Example 1. Compute the probability of getting two heads if a fair coin is tossed twice. (1/4) Example 2. The probability that a card drawn from an ordinary well-shuffled deck will be an ace or a spade. (16/52) Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Remarks The probabilities determined by the classical definition are called “a priori” probabilities since they can be derived purely by deductive reasoning. The “equally likely” assumption requires the experiment to be carried out in such a way that the assumption is realistic; such as, using a balanced coin, using a die that is not loaded, using a well-shuffled deck of cards, using random sampling, and so forth. This assumption also requires that the sample space is appropriately defined. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Troublesome limitations in the classical definition of probability: If the number of possible outcomes is infinite; If possible outcomes are not equally likely. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Relative frequency (or a posteriori) probability We observe outcomes of a random experiment which is repeated many times. We postulate a number p which is the probability of an event, and approximate p by the relative frequency f with which the repeated observations satisfy the event. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Suppose a random experiment is repeated n times under uniform conditions, and if event A occurred nA times, then the relative frequency for which A occurs is fn(A) = nA/n. If the limit of fn(A) as n approaches infinity exists then one can assign the probability of A by: P(A)= . Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

This method requires the existence of the limit of the relative frequencies. This property is known as statistical regularity. This property will be satisfied if the trials are independent and are performed under uniform conditions. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Example 3 A fair coin was tossed 100 times with 54 occurrences of head. The probability of head occurrence for each toss is estimated to be 0.54. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

The chain of probability definition Random experiment Sample space Event space Probability space Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Probability Model Each outcome can be thought of as a sample point, or an element, in the sample space. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Event and event space An event is a subset of the sample space. The class of all events associated with a given random experiment is defined to be the event space. An event will always be a subset of the sample space, but for sufficiently large sample spaces not all subsets will be events. Thus the class of all subsets of the sample space will not necessarily correspond to the event space. If the sample space consists of only a finite number of points, then the corresponding event space will be the class of all subsets of the sample space. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

 (the empty set) and  (the sure event) are both subsets of . An event A is said to occur if the experiment at hand results in an outcome that belongs to A. An event space is usually denoted by a script Latin letter such as A and B. Two events A and B are said to be mutually exclusive if and only if . Events are mutually exclusive if and only if . Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Event space and algebra of events Let A denote an event space, the following properties are called the Boolean algebra, or algebra of events: Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Probability function Let  denote the sample space and A denote an algebra of events for some random experiment. Then, a probability function P is a set function with domain A (an algebra of events) and counter domain the interval [0, 1] which satisfies the following axioms: Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Probability is a mapping (function) of sets to numbers. Probability is not a mapping of the sample space to numbers. The expression is not defined. However, for a singleton event , is defined. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Probability space A probability space is the triplet (, A, P[]), where  is a sample space, A is an event space, and P[] is a probability function with domain A. A probability space constitutes a complete probabilistic description of a random experiment. The sample space  defines all of the possible outcomes, the event space A defines all possible things that could be observed as a result of an experiment, and the probability P defines the degree of belief or evidential support associated with the experiment. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Finite Sample Space A random experiment can result in a finite number of possible outcomes. A sample space with only a finite number of elements (points) is called a finite sample space. Finite sample space with equally likely points – simple sample space Finite sample space without equally likely points Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Conditional probability Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Bayes’ theorem Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Multiplication rule Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

Independent events Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

The property of independence of two events A and B and the property that A and B are mutually exclusive are distinct, though related, properties. If A and B are mutually exclusive events then AB=. Therefore, P(AB) = 0. Whereas, if A and B are independent events then P(AB) = P(A)P(B). Events A and B will be mutually exclusive and independent events only if P(AB)=P(A)P(B)=0, that is, at least one of A or B has zero probability. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU

But if A and B are mutually exclusive events and both have nonzero probabilities then it is impossible for them to be independent events. Likewise, if A and B are independent events and both have nonzero probabilities then it is impossible for them to be mutually exclusive. Laboratory for Remote Sensing Hydrology and Spatial Modeling, Dept. of Bioenvironmental Systems Engineering, NTU