IPlant G-to-P test case for visualization: Model parameter estimation for QTL analysis Prepared by Jeff White out of Feb 16-17 working group meeting in.

Slides:



Advertisements
Similar presentations
Simple Linear Regression 1. 2 I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest.
Advertisements

Introduction to Regression ©2005 Dr. B. C. Paul. Things Favoring ANOVA Analysis ANOVA tells you whether a factor is controlling a result It requires that.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Reading Graphs and Charts are more attractive and easy to understand than tables enable the reader to ‘see’ patterns in the data are easy to use for comparisons.
Higher-Order Polynomial Functions
LSP 120: Quantitative Reasoning and Technological Literacy Section 118 Özlem Elgün.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Regression Diagnostics Using Residual Plots in SAS to Determine the Appropriateness of the Model.
Session 10a. Decision Models -- Prof. Juran2 Overview Forecasting Methods Exponential Smoothing –Simple –Trend (Holt’s Method) –Seasonality (Winters’
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
1 Business 260: Managerial Decision Analysis Professor David Mease Lecture 1 Agenda: 1) Course web page 2) Greensheet 3) Numerical Descriptive Measures.
ARIMA Forecasting Lecture 7 and 8 - March 14-16, 2011
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Chapter 12 Section 1 Inference for Linear Regression.
NU Data Excel Orientation Graphing of Screening Data and Basic Graphing Functions.
Chemometrics Method comparison
DISCLAIMER This guide is meant to walk you through the physical process of graphing and regression in Excel…. not to describe when and why you might want.
Descriptive Methods in Regression and Correlation
Inference for regression - Simple linear regression
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can contain the same information about quality What is meant.
STATISTICS: BASICS Aswath Damodaran 1. 2 The role of statistics Aswath Damodaran 2  When you are given lots of data, and especially when that data is.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.
Aim: How do scientists interpret data (Part 3)? Do Now: Copy the following: Line Graph - A graph that is used to display data that shows how one variable.
Quantitative Skills 1: Graphing
© 1998, Geoff Kuenning Linear Regression Models What is a (good) model? Estimating model parameters Allocating variation Confidence intervals for regressions.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Ecophysiological models - revisited Jeff White USDA-ARS, ALARC, Maricopa.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 19 Linear Patterns.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12: Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Simple Linear Regression. The term linear regression implies that  Y|x is linearly related to x by the population regression equation  Y|x =  +  x.
STATISTICS 12.0 Correlation and Linear Regression “Correlation and Linear Regression -”Causal Forecasting Method.
Time Series A collection of measurements recorded at specific time intervals.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Example 16.6 Regression-Based Trend Models | 16.1a | 16.2 | 16.3 | 16.4 | 16.5 | 16.2a | 16.7 | 16.7a | 16.7b16.1a a16.7.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Chapter 10 Inference for Regression
Data Analysis, Presentation, and Statistics
How to draw a line graph Yr 9 Science. Line graphs Line graphs can be useful for showing the results of an experiment. You usually use a line graph when.
STATISTICS 12.0 Correlation and Linear Regression “Correlation and Linear Regression -”Causal Forecasting Method.
Tutorial I: Missing Value Analysis
Linear Regression Models Andy Wang CIS Computer Systems Performance Analysis.
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1
BPS - 5th Ed. Chapter 231 Inference for Regression.
MODEL DIAGNOSTICS By Eni Sumarminingsih, Ssi, MM.
HOW TO USE.
Statistical Data Analysis - Lecture /04/03
Date of download: 12/22/2017 Copyright © ASME. All rights reserved.
SIMPLE LINEAR REGRESSION MODEL
Describing Relationships
Linear Regression Models
Regression and Residual Plots
CHAPTER 29: Multiple Regression*
Chapter 10 Correlation and Regression
Signal, Noise, and Variation in Neural and Sensory-Motor Latency
Basic Practice of Statistics - 3rd Edition Inference for Regression
CHAPTER 12 More About Regression
Volume 111, Issue 2, Pages (July 2016)
CHAPTER 3 Describing Relationships
Linear Regression Dr. Richard Jackson
Data Analytics Case Study
Presentation transcript:

iPlant G-to-P test case for visualization: Model parameter estimation for QTL analysis Prepared by Jeff White out of Feb working group meeting in Kansas City Basic data set is Maize NAM lines – 27 populations x ~200 lines – 11 environments (6 sites x 2 years except Puerto Rico) Parameters estimated for CSM-CERES-Maize – P1: Was used as a surrogate for earliness per se, but is actually duration of juvenile phase – P2: Determines degree of delay for daylengths longer than the critical short daylength Prepared by Jeff White, USDA ARS, ALARC Phenotypic data were provided by Maize NAM project on the understanding they would not be redistributed or published until their phenology paper comes out. Thus, a SAS program is available but we’d need to check with Ed Buckler and Jim Holland before I provide that data file.

About the graphs These are all pretty simple Y vs X type graphs A key feature that is not shown is the ability to drill down by identifying points or clouds of points (e.g., a single line or location) The first series of plots are all observed vs simulated. The second series deals with the coefficients and prediction error (as RMSE)

Observed vs predicted: locations 1:1 1.I added the 1:1 line 2.Ideally it might have the linear regression or even regressions by location 3.For any point or cluster, one would want the population, line, & year. For example, the chain of blue points… 4.This graph may be misleading because many points are overlain. One option would be to start with a density plot. What are these?

Observed vs predicted: environments 1:1 1.Here locations are subdivided by 2.year. 3.Note the poor handling of the legend by SAS Gplot. 4.For any point or cluster, one would want the population, line, & year. For example, the chain of blue points… Curious difference between Aurora in 2006 and Why?

Observed vs predicted: NY by two years 1:1 1.Just looking at NY datasets 2.Scale could have been re- sized, although conserving the scale helps in comparisons across plots. 3.Again, for any point or cluster, one would want the population, line, & year. For example, the chain of blue points… 4.This graph may be misleading because many points are overlain. One could use open and closed symbols, or allow toggling by year to highlight points. What are these?

Observed vs predicted: populations 1:1 1.I added the 1:1 line 2.It looks like there is a major problem with population 26. Could it be that the trusty Dr. White forgot to calibrate this population? Or that GenCalc failed to converge…? 3.Again, there are interesting chains. The circled one is NY, This graph is very ugly because many points are overlain. Could we toggle populations on and off with check boxes? What are these?

Deviations of Simulated - Observed: populations 1:1 1.Deviation plots provide a different perspective 2.Again, it looks like there is a major problem with population 26, but the wider spread of points allows one to see possible problems with population 5 and 27. What are these?

From here onward, there is a change in datsets. The next series is based on the two fitted model coefficients and associated data Model parameters – P1 = length of the juvenile phase – P2 = photoperiod sensitivity Associated data – RMSE = root mean square of prediction – No. observ = number of observations that the optimization program used. Maximum possible number is 11.

P1 vs P2 - Populations 1.Note suspicious clumping of values 2.No clear trends of P1 in relation to P2, which is generally good. 3.It would be nice to highlight individual populations.

RMSE vs P1 : populations 1.Again note suspicious clumping of P1 values 2.Slight trend of increasing RMSE with P1, which makes sense. 3.It would be nice to highlight individual populations.

RMSE vs P2 : populations 1.Again note suspicious clumping of values. 2.No trend of increasing RMSE with P2, which makes sense. 3.It would be nice to highlight individual populations.

RMSE vs Number of observations: populations 1.Suggests goodness of fit declines with less than 7 observations but perhaps gets slightly better as observations increase. 2.It would be nice to highlight individual populations or pull up underlying data of individual points. Why is this point so far off?

RMSE vs slope of observed vs simulated for each line: populations 1.Suspicious clumping of values. 2.It would be nice to highlight individual populations.

Slope of observed vs simulated for each line: first three populations 1.Clumping problem remains. 2.Limiting to three populations shows interesting differences. 3.But gain there may be problems with data points that sit on top of each other.

Array for viewing very large sets of observed or simulated phenotypes – see next page Vertical axis is populations (1 to 27) – Within each population the rows are ordered by location and year: 01 Aurora, NY 2006 Summer NY Aurora, NY 2007 Summer NY Clayton, NC 2006 Summer NC Clayton, NC 2007 Summer NC Columbia, MO 2006 Summer MO Columbia, MO 2007 Summer MO Urbana, IL 2006 Summer IL Urbana, IL 2007 Summer IL Homestead, FL 2006 Winter FL Homestead, FL 2007 Winter FL Puerto Rico 2006 Winter PR The horizontal axis is ordered by mean time to anthesis across all environments (location x year) Each symbol is a binned value of observed days to anthesis. White spaces indicate missing values.

Slope of observed vs simulated for each line: first three populations 1.Clumping problem remains. 2.Limiting to three populations shows interesting differences.

Concluding remarks The basic principal in the first two set of examples is Y vs X with ability to drill down to subsets of data, especially to identify specific populations, locations or lines (factors that describe the data). The third example is more speculative and its real value is unclear. The objective is to provide a quick overview of large arrays of data such as the maize NAM observed anthesis data. The patterns would be clearer with better scaling and color coding of the “bins”. GIS software is much better at this than SAS.