Multivariate environmental characterization of samples

Slides:



Advertisements
Similar presentations
Sampling Design, Spatial Allocation, and Proposed Analyses Don Stevens Department of Statistics Oregon State University.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
PARTITIONAL CLUSTERING
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Interpreting Principal Components Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n.
Applications of Spatial Statistics in Ecology Introduction.
A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis,
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Selecting Diverse Sets of Compounds C371 Fall 2004.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
L15 – Spatial Interpolation – Part 1 Chapter 12. INTERPOLATION Procedure to predict values of attributes at unsampled points Why? Can’t measure all locations:
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Multivariate statistical methods Cluster analysis.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
Sample design strategies
Spatial statistics: Spatial Autocorrelation
PREDICT 422: Practical Machine Learning
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Quantifying Scale and Pattern Lecture 7 February 15, 2005
Data Mining K-means Algorithm
Research in Computational Molecular Biology , Vol (2008)
Sequencing type and molecular markers
GIS basics Landscape genomics Dr Stéphane Joost – Oliver Selmoni (Msc)
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Global and local spatial autocorrelation
Clustering in Ratemaking: Applications in Territories Clustering
Environmental data Landscape genomics
Principal Component Analysis (PCA)
Molecular phenotyping of HCS-2/8 cells as an in vitro model of human chondrocytes  J. Saas, Ph.D., K. Lindauer, Ph.D., B. Bau, M.Sc., M. Takigawa, D.D.S.,
Clustering (3) Center-based algorithms Fuzzy k-means
Microarray Clustering
Clustering Basic Concepts and Algorithms 1
Interpreting Principal Components
Land cover data: a short theoretical introduction
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimension reduction : PCA and Clustering
Biointelligence Laboratory, Seoul National University
Multivariate Analysis of a Carbonate Chemistry Time-Series Study
Clustering Wei Wang.
Cluster Analysis.
Group 9 – Data Mining: Data
Principal Component Analysis
Lecture 8: Factor analysis (FA)
Unsupervised Learning
Sampling Plans.
Presentation transcript:

Multivariate environmental characterization of samples Landscape genomics Physalia courses , November 6-10, 2017, Berlin Multivariate environmental characterization of samples Dr Stéphane Joost – Oliver Selmoni (Msc) Laboratory of Geographic Information Systems (LASIG) Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland ESF CONGENOMICS RNP Spring School - March 2014 – Stéphane Joost, Kevin Leempoel, Sylvie Stucki

Maximising environmnental contrast Landscape genomics best works when environmental contrast is maximized between sampling locations This is a first rational reason to identify optimal sampling locations and to develop adapted methods There is another reason, another constraint The recent availability of whole genome sequence (WGS) data implies to reconsider sampling strategies in landscape genomics for economic reasons While we had many individuals and few genetic markers ten years ago, we now face the contrary with high costs of WGS limiting the number of samples to be sequenced Molecular resolution is excellent but it is achieved at the expense of spatial representativeness and statistic robustness (we must remove sampling points) Therefore, starting from a standard sampling, it becomes comnpulsory to apply sub-sampling strategies in order to keep most of the environmental variation

Moroccan example (Nextgen) To study local adaptation of goats and sheep’s breeds in Morocco, we used a sampling design based on a regular grid across the territory In each cell of this grid, 3 individuals were sampled in 3 different farms. Then, the final subset selected for sequencing had to meet two criteria in order to ensure a regular cover of both physical (x,y) and environmental (z) spaces using a stratified sampling technique over a range of climatic variables previously filtered by a PCA minimizing a clustering index in order to ensure spatial spread In Morocco, the sub-sampling procedure using a hierarchical clustering resulted in two datasets: 162 goats selected over 1283, and 162 sheep over 1412 Based on variables such as temperature, pluviometry and solar radiation. By maximising the environmental information collected, we were able to select individuals that are the most relevant to study adaptation.

Approach Individuals must be continuously distributed along environmental gradients to regularly sample the space of climatic parameters Taking into account climatic extremes does not provide the adequate information to calibrate generalized linear models and is not useful for population genetics. We use PCA to transform K correlated variables (environmental variables) into K uncorrelated factors, the main components

The role of PCA The first axis represents the direction along which the points vary the most The following axes explain in turn the maximum variance of the remaining variance, while being orthogonal to each other The components are ordered according to the variance they explain

The role of PCA The PCA on environmental variables provides a synthetic picture of the climate It allows to choose samples "as different as possible": the first main axes explain most of the climatic variability while limiting random noise The Euclidean distance in this subspace makes it possible to define an ecological distance between the farms This limits the weight given to highly correlated factors when selecting indivuals to sample

Hierarchical Ascendant Classification (HAC) The choice of a limited number of points representing habitat diversity involves grouping farms according to their similarities (based on ecological distances) and selecting one for each group We used a HAC with the Ward criterion Iterative method where the two closest elements (points and/or groups) are gathered at each stage Ward's algorithm chooses which elements to merge by minimizing the increase in intra-class inertia This criterion tends to bring together close classes containing few elements HAC is deterministic and the partitions generated at each iteration are nested. This is why the process is often represented in the form of a tree.

Constrained draw (selection of points) The sampling strategy based on a grid provides a regular density of points on the territory The random choice of one farm per class is likely to introduce a spatial bias This is why the spatial representativeness of the selected samples is quantified with a distribution index (D) We define D as the sum of the distances between each point and its nearest neighbor in the selection The higher D, the more distant the points are from each other and the better their spatial distribution

Sheep in Morocco - PCA 432 farms with sheep 117 environmental variables We kept 7 first components (96% of variance) The space generated by these axes reliably represents the climatic conditions

Sheep in Morocco - HAC The Euclidean distance in this space makes it possible to define an environmental distance between the farms with the perspective of their aggregation Some farms are in the same climate cell and have been grouped for analysis, which leads to 229 different climatic conditions for sheep These groups of farms represent distinct points in the climatic conditions and served as basic units for HAC The following map shows the classification for 10 classes

HAC in 164 classes As we have money for WGS for 164 individuals, we must apply a HAC to get 164 classes Each class contains 1-3 farms = 1-18 individuals Finally 50 random draws of one individual per class and retained the one with the highest spatial distribution index (D)

Exercise Characterizing samping points with multivariate environmental data