Exploring EDA 1 iCSC2015,Vince Croft, NIKHEF -Nijmegen Exploring EDA, Clustering and Data Preprocessing Lecture 1 Exploring EDA Vincent Croft NIKHEF -

Slides:



Advertisements
Similar presentations
Lecture 2 Summarizing the Sample. WARNING: Today’s lecture may bore some of you… It’s (sort of) not my fault…I’m required to teach you about what we’re.
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Random Sampling and Data Description
Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.
Math 20: Foundations FM20.6 Demonstrate an understanding of normal distribution, including standard deviation and z-scores. FM20.7 Demonstrate understanding.
What is Statistical Modeling
Statistics [0,I/2] The Essential Mathematics. Two Forms of Statistics Descriptive Statistics What is physically happening within the data? Inferential.
BSc/HND IETM Week 9/10 - Some Probability Distributions.
Evaluating Hypotheses
Statistics Intro Univariate Analysis Central Tendency Dispersion.
Those who don’t know statistics are condemned to reinvent it… David Freedman.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 4: The Normal Distribution and Z-Scores.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Measures of Central Tendency
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Introduction to Statistics Measures of Central Tendency.
Describing distributions with numbers
Chapter 1 Descriptive Analysis. Statistics – Making sense out of data. Gives verifiable evidence to support the answer to a question. 4 Major Parts 1.Collecting.
Quantitative Skills: Data Analysis and Graphing.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Covariance and correlation
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Statistics. Intro to statistics Presentations More on who to do qualitative analysis Tututorial time.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Quantitative Skills 1: Graphing
Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Applied Quantitative Analysis and Practices LECTURE#11 By Dr. Osman Sadiq Paracha.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
QUANTITATIVE RESEARCH AND BASIC STATISTICS. TODAYS AGENDA Progress, challenges and support needed Response to TAP Check-in, Warm-up responses and TAP.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
NORMAL DISTRIBUTION AND ITS APPL ICATION. INTRODUCTION Statistically, a population is the set of all possible values of a variable. Random selection of.
Chapter 11 Univariate Data Analysis; Descriptive Statistics These are summary measurements of a single variable. I.Averages or measures of central tendency.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 4: Variability. Variability Provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together.
Engineering Statistics KANCHALA SUDTACHAT. Statistics  Deals with  Collection  Presentation  Analysis and use of data to make decision  Solve problems.
Statistical Analysis Topic – Math skills requirements.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
1 ECE310 – Lecture 22 Random Signal Analysis 04/25/01.
Data Analysis, Presentation, and Statistics
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
AP Statistics. Chapter 1 Think – Where are you going, and why? Show – Calculate and display. Tell – What have you learned? Without this step, you’re never.
Exploratory Data Analysis
INTRODUCTION TO STATISTICS
MATH-138 Elementary Statistics
Review 1. Describing variables.
Basic Statistics Overview
Description of Data (Summary and Variability measures)
Introduction to Summary Statistics
AP Exam Review Chapters 1-10
Introduction to Statistics
MEDIATED MOOCS Introduction to descriptive statistics
Welcome!.
Lecture 1: Descriptive Statistics and Exploratory
Descriptive Statistics
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

Exploring EDA 1 iCSC2015,Vince Croft, NIKHEF -Nijmegen Exploring EDA, Clustering and Data Preprocessing Lecture 1 Exploring EDA Vincent Croft NIKHEF - Nijmegen Inverted CERN School of Computing, February 2015

Exploring EDA 2 iCSC2015,Vince Croft, NIKHEF -Nijmegen A picture tells a thousand words.  Before writing language or even words; people conveyed ideas with pictures.  Pictures Represent a summary of our interpretation of our world.  What are some methods we can use to convey the maximum possible understanding from our data without loss of information?  First we must understand our data

Exploring EDA 3 iCSC2015,Vince Croft, NIKHEF -Nijmegen Probability vs. Statistics  Not the same thing…  Probability teaches us how to win big money in casinos.  Statistics shows that people don’t win big money in casinos.  Statistics is how we learn from past experiences  Exploratory Data Analysis is concerned with how to best learn from what data we have.

Exploring EDA 4 iCSC2015,Vince Croft, NIKHEF -Nijmegen Summary of things to come.  Visualization Basics.  What does data look like?  Understanding variables and distributions.  Manipulating Data.  Range, outliers, binning.  Transformations.  Adding Variables  Extracting hidden information  Correlation, Covariance, Dependence  Intro to MVA  Adding more variables, more information, and a gateway to lecture 2

Exploring EDA 5 iCSC2015,Vince Croft, NIKHEF -Nijmegen This Lecture is Brought to you by the letter R  R is a free open source programming language for statistics and data visualisation.  Simpler to learn then other languages such as python but more versatile then point and click programs such as SPSS  Many lectures and tutorials on the subject of EDA use examples given in R

Exploring EDA 6 iCSC2015,Vince Croft, NIKHEF -Nijmegen Worked Examples  All examples will be available online  If you are not here in person or want to see the examples presented for yourself please see the support documentation on my institute web page

Exploring EDA 7 iCSC2015,Vince Croft, NIKHEF -Nijmegen Other Resources  Coursera  “Exploratory Data Analysis” by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD  Udacity  “Data Analysis with R” by Facebook  Udacity  Intro to Hadoop and MapReduce by cloudera  Methods of Multivariate Analysis  Alvin C Rencher

Exploring EDA 8 iCSC2015,Vince Croft, NIKHEF -Nijmegen What does data look like?  Everyone believes data  No-one believes numbers  Images must reflect the data in the way that conveys the desired message.  You can sell most ideas with the power of a pie chart…  …But you can’t find Higgs with one.

Exploring EDA 9 iCSC2015,Vince Croft, NIKHEF -Nijmegen Types of plots - Pie Chart  Shows proportions of groupings relative to a whole

Exploring EDA 10 iCSC2015,Vince Croft, NIKHEF -Nijmegen Types of plots – Histogram  Histogram  Shows Frequency of occurrence  Easy to see proportion  Easy to interpret (with some practice)  Used to estimate the probability density of a continuous variable (advanced)

Exploring EDA 11 iCSC2015,Vince Croft, NIKHEF -Nijmegen Types of plots – Scatter Plot  Shows Relationship of 2 variables.  We shall return to these later.

Exploring EDA 12 iCSC2015,Vince Croft, NIKHEF -Nijmegen Types of plots – Box Plot  Shows Spread of variables  Useful for comparisons  More commonly used for Probabilistic interpretation of data.

Exploring EDA 13 iCSC2015,Vince Croft, NIKHEF -Nijmegen Types of plots – Heat Map  Heat maps show the level of a single variable varies across a 2D plane  Useful for recognising interesting points in the plane  Often very intuitive

Exploring EDA 14 iCSC2015,Vince Croft, NIKHEF -Nijmegen Information contained in a plot  Comparing Mean, Median and Mode  Some plots represent more information than others.  A bar graph can only compare single values  A histogram represents a sample of an underlying probability density distribution  The mean value is the most probable next value given the values given...  Useful for predictions.

Exploring EDA 15 iCSC2015,Vince Croft, NIKHEF -Nijmegen Variance  Though a pie chart can often be very good at conveying a summary of data, it says little of the distribution.  The variance gives a measure of how accurately summaries of the data such as the mean represent the actual data.  The variance of a histogram is seen in the spread of points.

Exploring EDA 16 iCSC2015,Vince Croft, NIKHEF -Nijmegen Variance Continued  Each variable, each distribution and each set of measurements has a variance.  The variance is a description of how stable that variable is.  E.g. if a variable is erratic and all measurements seem unrelated to each other it has a large variance.  The variance is related to how accurately we can predict the value of a variable

Exploring EDA 17 iCSC2015,Vince Croft, NIKHEF -Nijmegen Common Distributions - Gauss  Also known as ‘Normal’ distribution or Bell curve.  One of the most commonly seen distributions in nature.  Mean=Median=Mode

Exploring EDA 18 iCSC2015,Vince Croft, NIKHEF -Nijmegen Common Distributions - Exponential  Commonly seen in lifetimes.  Represents the time between two independent and random events.  Memoryless  A good model for many things from radioactive decay to requests for documents on a web server.

Exploring EDA 19 iCSC2015,Vince Croft, NIKHEF -Nijmegen Displaying your data  Range – focus on interesting features  Binning – What represents the data best?  Error in measurement  More bins then possible values?

Exploring EDA 20 iCSC2015,Vince Croft, NIKHEF -Nijmegen Noise – Bias – Sampling Error  If a graph looks noisy, most likely you have too many bins for the data you’re plotting  Using too few bins increases likelihood of introducing a bias (plot doesn’t represent the true distribution)  Variance is a measure of how well we can predict a value. If we hide this feature of the data by increasing bin size then we risk loosing information.  Noise or Variance? It’s sometimes a tough decision!

Exploring EDA 21 iCSC2015,Vince Croft, NIKHEF -Nijmegen Transformations  Division – Binning  You can scale one axis or change the binning.  You can divide all values by another set of values…  Log Scale  y – focus on interesting features that happen in tails of the distribution  Others  Square Root  1/x

Exploring EDA 22 iCSC2015,Vince Croft, NIKHEF -Nijmegen Return to the Scatter Plot  Shows Relationship of 2 variables.

Exploring EDA 23 iCSC2015,Vince Croft, NIKHEF -Nijmegen Extracting Information  Finding the gradient of the distribution  Looks like husbands are generally older than their wives?  Lets generate some distribution using the standard creepiness rule…

Exploring EDA 24 iCSC2015,Vince Croft, NIKHEF -Nijmegen Marginal Distribution  Let the whole distribution fall onto one axis  Used to obtain 1 Dimensional Properties from multidimensional distributions.  Found by summing all the variables in a table along either rows or columns.

Exploring EDA 25 iCSC2015,Vince Croft, NIKHEF -Nijmegen 2D Transformations  If we think of variables as measurements taken from a certain position, transformations can be used to see measurements from a different perspective.  Useful information can be extracted from the transformed distribution.  Transformed variables might have some physical meaning or demonstrate some interesting feature.

Exploring EDA 26 iCSC2015,Vince Croft, NIKHEF -Nijmegen Correlation and Covariance  Correlation and covariance is the degree to which we expect one variable to behave given the action of another.  e.g. taller people usually weigh more. Therefore human height and weight co-vary and are correlated  Both Correlation and Covariance describe the deviation of variables away from the mean  Covariance depends on the scale of the measurement. Has units!  Correlation is a standardised covariance such that it can be measured between -1 and 1 without units.

Exploring EDA 27 iCSC2015,Vince Croft, NIKHEF -Nijmegen Covariance and Dependence  When looking at more than one variable almost invariably we are interested in seeing their relationship.  Variables can be related to an underlying property (such as the angle between the vectors)  Or can be directly dependent on each other  Covariance assesses relationship between the variance of two variables. cov(X,Y)=cov(X)  cov(Y) if independent.  Covariance has units!

Exploring EDA 28 iCSC2015,Vince Croft, NIKHEF -Nijmegen Correlation and Causation  Correlation between 2 variables intuitively implies that the two distributions are linked.  Correlation is the departure of 2 or more variables from independence.  Correlation implies shared information.  The two variables don’t necessarily cause each other nor that both are caused by a mutual cause or it could just be coincidence. See for interesting correlationshttp://

Exploring EDA 29 iCSC2015,Vince Croft, NIKHEF -Nijmegen Characterising 2D Data  2 Variables such as X and Y can be considered as two vectors of measurements.  These vectors can be mapped to the x and y axis of a scatter plot.  The means and variances of each can be extracted from the marginal distributions of this plot  The correlation between these plots can be understood as the cosine of the angle between the vectors X and Y

Exploring EDA 30 iCSC2015,Vince Croft, NIKHEF -Nijmegen Multivariate Analysis.  Every thing that applies to 2 variables applies to N variables.  The Histogram that became a scatter plot now becomes a heat map in 3 dimensions.  We can use transformations to reduce the number of dimensions  People don’t understand MVA in more than 3D so understanding data manipulation becomes very important.