Methods of Exploratory Data Analysis GG 313 Fall 2003 8/25/05.

Slides:



Advertisements
Similar presentations
Samples and Populations
Advertisements

Inference for Regression
And standard deviation
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
STATISTICS. SOME BASIC STATISTICS MEAN (AVERAGE) – Add all of the data together and divide by the number of elements within that set of data. MEDIAN –
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Chapter 5: Understanding and Comparing Distributions
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
GG450 April 22, 2008 Seismic Processing.
Multimedia Data Introduction to Image Processing Dr Mike Spann Electronic, Electrical and Computer.
Lecture 5 Outline – Tues., Jan. 27 Miscellanea from Lecture 4 Case Study Chapter 2.2 –Probability model for random sampling (see also chapter 1.4.1)
GG313 Lecture 8 9/15/05 Parametric Tests. Cruise Meeting 1:30 PM tomorrow, POST 703 Surf’s Up “Peak Oil and the Future of Civilization” 12:30 PM tomorrow.
GRAVITY Analysis & Interpretation GG 450 Feb 5, 2008.
1 Confidence Interval for the Population Mean. 2 What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Edpsy 511 Homework 1: Due 2/6.
Part II – TIME SERIES ANALYSIS C2 Simple Time Series Methods & Moving Averages © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Probability and Statistics in Engineering Philip Bedient, Ph.D.
Statistical Process Control
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Plug & Play Middle School Common Core Statistics and Probability using TinkerPlots.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
REPRESENTATION OF DATA.
Dr. Serhat Eren DESCRIPTIVE STATISTICS FOR GROUPED DATA If there were 30 observations of weekly sales then you had all 30 numbers available to you.
Unit 1.4 Recurrence Relations
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
Data Handling & Analysis BD7054 Scatter Plots Andrew Jackson
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.
Copyright © 2009 Pearson Education, Inc. Chapter 5 Understanding and Comparing Distributions.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
What is an error? An error is a mistake of some kind... …causing an error in your results… …so the result is not accurate.
Slide 6-1 Copyright © 2004 Pearson Education, Inc.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Multimedia Data Introduction to Image Processing Dr Sandra I. Woolley Electronic, Electrical.
Copyright © 2009 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Confidence Intervals for Proportions Chapter 8, Section 3 Statistical Methods II QM 3620.
GG 313 Geological Data Analysis Lecture 13 Solution of Simultaneous Equations October 4, 2005.
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Analysis of Residuals ©2005 Dr. B. C. Paul. Examining Residuals of Regression (From our Previous Example) Set up your linear regression in the Usual manner.
Graphing Data Box and whiskers plot Bar Graph Double Bar Graph Histograms Line Plots Circle Graphs.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Describing Distributions Numerically.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Quadratic Regression ©2005 Dr. B. C. Paul. Fitting Second Order Effects Can also use least square error formulation to fit an equation of the form Math.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
ANOVA, Regression and Multiple Regression March
Statistical Techniques
Mathematical Plots By: Amber Stanek.
Chapter 5 The Standard Deviation as a Ruler and the Normal Model.
A random variable is a variable whose values are numerical outcomes of a random experiment. That is, we consider all the outcomes in a sample space S and.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
1 Research Methods in Psychology AS Descriptive Statistics.
Copyright © 2009 Pearson Education, Inc. Slide 4- 1 Practice – Ch4 #26: A meteorologist preparing a talk about global warming compiled a list of weekly.
CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 6- 1.
Figure 2-7 (p. 47) A bar graph showing the distribution of personality types in a sample of college students. Because personality type is a discrete variable.
Linear Algebra Review.
BAE 6520 Applied Environmental Statistics
Describing Distributions Numerically
Objective: Given a data set, compute measures of center and spread.
Statistical Methods For Engineers
Section Ii: statistics
Displaying and Summarizing Quantitative Data
Presentation transcript:

Methods of Exploratory Data Analysis GG 313 Fall /25/05

CRUISE Save October (Monday-Thurs) For a STUDENT CRUISE on the R/V Kilo Moana

Scatter Plots We did an example Tuesday with the tide data - let’s look at another: These data taken at Scripps Pier (LaJolla, Ca) on Dec The data are taken every second and the units are cm.

We’ll do a very simple MatLab plot. There are too many data points for Excel to handle. The data are in my computer in file df07301.txt in a single column of numbers. The Matlab commands are: load 'df07301.txt’ plot (df07301)

These data show the tidal components as the long- period oscillation, the normal ocean waves as the thickening of the blue line, and the signal from the Sumatra earthquake tsunami as the larger thickening. What does this plot tell us about our data? It’s clean - no wild points If we’re after the tsunami signal, we’ve got it If we want to see it better, we need to do some analysis There are sec in a day, so this plot is /86400=5.78 days long.

Just for fun, let’s try one more technique before we leave this data set. The tidal signal is noise for us, so let’s subtract it from the data. To do this we first apply a FILTER to the data to isolate the tidal signal: windowsize=3600 ; % that’s 1 hour lpout=filter(ones(1,windowsize)/windowsize,1,df07301); This does a pretty good job of SMOOTHING the data, isolating the tidal signal from the waves - both tsunami and wind waves. Now we subtract the filtered data from the original data: Hipass=df07301-lpout;

What does BAD data look like? Be extremely careful before discounting the validity of data. Some of the most important theories have come from data that looked wrong. Early El-Nino data were rejected by a computer program because they were so far from normal! Recognition of bad data takes experience. Determination of the origin of these data is particularly important to be sure that rejection is justified. Here are some examples:

The anomalous data don’t look bad, but they are too deep by 750 m. This is a key number in that sound travels at 1500 m/s in water, and sound from a ship is observed reflecting off the ocean floor at 750 m depth for each second. That is, sound returning to the ship one second after it was generated implies a water depth of 750m. We often generate sound pulses once per second, so it’s easy to make a 750 m mistake.

A good example of how anomalous data can lead to discovery was presented by Lord Rayleigh in In an investigation of the density of nitrogen, he collected the data shown below: These data don’t look much different from each other, but take a closer look:

The scatter in these data certainly do not look like what would be expected from sampling of a single population - and the distribution of the data do not look like they could be caused by measurement error. In fact, the higher weight data come from air, and the lower weights come from nitrogen in chemicals. Rayleigh used these data to prove that another element was present in air - argon.

Box and whisker plot Another important plot for preliminary analysis is the box and whisker plot. At least five statistical values are plotted to get a quick look at some basic statistics of data samples. The box shows the region containing the middle half of the data, and the vertical line shows the middle, or median, value.

Box and whisker plots are most informative to compare different samples: The plots above show what might be expected from samples of an experiments with a Poisson distribution.

The box and whisker plot for Lord Rayleigh’s data looks like this: This plot tells us that the distribution of these data is weird, and we should look closely at it to see what’s going on.

Histograms Histograms are used to plot the frequency of occurrence of particular events. For example, the time between large earthquakes in the Aleutian subduction zone: (these are not real data) We will see more of these plots as we study the basics of statistics and probability.

Smoothing As we saw earlier, filtering, which we will discuss in detail later, can go a long way to separate signals from noise. Different filters have different functions and characteristics. Functions in exploratory analysis include removal of occasional bad data points, removal of high frequency noise, and removal of trends. Removal of occasional bad points is best done by a median filter. This filter compares three consecutive points, replacing the middle point by the median point of the three. This is a very effective filter for removing noise spikes in data-as long as the spikes are separated by more than 1 point.

In the Scripps Pier data, I’ve replaced occasional data points by zeroes. This is a common problem in real data. Application of the median filter adds a small amount of noise while removing the spikes.

After applying the median filter, the data are (nearly) back to normal: The Matlab function for this operation is medfilt1(x,3) Where x is the data and 3 is the number of points in the window.

Smoothing can also involve the removal of high frequency noise, like we did to isolate the tidal signal in the Scripps pier data. A Hanning filter can do this by marching through the data 3 points at a time, weighting the middle point higher than the ones on each side: Note: Wessel’s notes are not quite correct for this equation. This filter works poorly for spikes in the data - spreading the spikes out, rather than removing them.

Residual plots Often data can be divided into parts - a smooth trend and a higher frequency signal. Your signal could be either the trend or the residual after the trend is removed. To remove a linear trend from data, you could pick two points to define the line, x 1, y 1 and x 2, y 2. The linear trend is then: This is the equation of a straight line which can be subtracted from the data.

The trend need not be linear, and other functions can be tried to remove a trend, such as √y, log(y), y 2, etc. - whatever fits. Often, a good understanding of why the trend is there can aid in its removal. Let’s run a MatLab program Dr. Wessel wrote to display some of the topics we’ve been discussing. gg313_EDA.m