A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis,

Slides:



Advertisements
Similar presentations
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Advertisements

PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
An Introduction to Multivariate Analysis
INDICATOR EVALUATION An indicator of appropriate fertilization practices must fulfill some criteria (SAFE) : No Discriminating power time and space >
Dimension reduction (1)
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Factor Analysis There are two main types of factor analysis:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
1 Pertemuan 21 Matakuliah: I0214 / Statistika Multivariat Tahun: 2005 Versi: V1 / R1 Analisis Struktur Peubah Ganda (I): Analisis Komponen Utama.
Tables, Figures, and Equations
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Factor Analysis Psy 524 Ainsworth.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
CHAPTER 26 Discriminant Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Machine Learning Applications in Biological Classification of River Water Quality Saso Dzeroski, Jasna Grobovic and William J. Walley 조 동 연.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
Available at Chapter 13 Multivariate Analysis BCB 702: Biostatistics
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Carlos H. R. Lima - Depto. of Civil and Environmental Engineering, University of Brasilia. Brazil. Upmanu Lall - Water Center, Columbia.
Mining Anomalies Using Traffic Feature Distributions Anukool Lakhina Mark Crovella Christophe Diot in ACM SIGCOMM 2005 Presented by: Sailesh Kumar.
Lecture 12 Factor Analysis.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Applied Quantitative Analysis and Practices
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis (PCA)
Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Advanced Statistics Factor Analysis, I. Introduction Factor analysis is a statistical technique about the relation between: (a)observed variables (X i.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Principal Component Analysis (PCA)
Exploratory Factor Analysis
Unsupervised Learning
Mini-Revision Since week 5 we have learned about hypothesis testing:
JMP Discovery Summit 2016 Janet Alvarado
Factor analysis Advanced Quantitative Research Methods
Principal Component Analysis (PCA)
Multidimensional Scaling and Correspondence Analysis
Quality Control at a Local Brewery
X.1 Principal component analysis
Marios Mattheakis and Pavlos Protopapas
Unsupervised Learning
Presentation transcript:

A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis, principal component analysis, discriminant analysis and factor analysis. Chemometric analysis confirmed the classification of water purity of the Brda river made by the Inspection of Environmental Protection but the results showed more differentiation between monitored locations.

introduction The classification of the river water quality is generally based on the comparison of measured values of a particular pollution parameter’s concentration with the limit values as defined in the adequate legal instrument. The quality standards for rivers are defined on the basis of the water use criteria. Moreover, 90% of the values must meet the appropriate level to attribute the water to one of the five classes.

There are two most common methods of unsupervised pattern recognition, Namely, cluster analysis (CA) and principal component analysis (PCA) with factor analysis (FA). Usually, CA is carried out to reveal specific links between Sampling points, while PCA and multiple regression analyses are used to identify the ecological aspects of pollutants on environmental systems. CA were used for site similarity analysis, whereas for the identification of sources of pollution, PCA followed by absolute principal Component scores were applied.

Brda river and its basin Brda is one of the western tributaries of the Vistula river. It flows through the Kaszuby and Kujawy regions in the Northern part of Poland. The total length of the Brda river is 238 km. The Brda catchment covers an area of 4693km2. Agriculture, farming and aquaculture are quite intensive, with many cattle, pig, poultry, sheep, and fish farms. This is likely to cause water quality problems with respect to organic waste and nutrients. Industries are mainly found in and around Bydgoszcz, varying from food production, metals and ceramics to chemical industries (pharmaceutical, organic– chemical, etc.).

The dataset and statistical procedures (a) Dataset: The dataset covers the period from January 1994 to December 2002 and contains the values of selected pollution indicators for 20 monitoring locations. Locations of the monitoring points and selected pollution indicators are presented on Fig. 1 and Table 1.

(b) Cluster analysis: CA is an exploratory data analysis tool for solving classification problems. Its objective is to sort cases (monitoring pints) into groups, or clusters. The descriptor variables (pollution indicators) were block standardized by range (autoscaling) to avoid any effects of scale of units on the distance measurements by applying the equation: The similarities–dissimilarities were quantified through Euclidean distance measurements; the distance between monitoring point locations, i and j, is given as Normalized Euclidean distances and the Ward’s method were used to obtain dendrograms

(c) Principal component analysis and factor analysis: On the basis of the dataset, new orthogonal variables (factors) as the linear combination of original parameters are calculated. Owing to this, all information about the objects gathered in the original multidimensional dataset can be performed in the reduced space and explained by a reduced set of calculated factors called principal components (PCs). Identified PCs (e.g. by eigenvalue-one criterion) account for the maximum explainable variance of all original property parameters in a descending order.

Factoe analysis: FA is a useful tool for extracting latent information, such as not directly observable relationships between variables. The original data matrix is decomposed into the product of a matrix of factor loadings and a matrix of factor scores plus a residual matrix. In general, by applying the eigenvalue-one criterion, the number of extracted factors is less than the number of measured features. So the dimensionality of the original data space can be decreased by means of FA.

(d) Discriminant analysis: The principle of DA consists of the separation of a priori given classes of objects. The variance–covariance between the classes is maximized and the variance–covariance within the classes is minimized under simultaneous consideration of all analyzed features.

Results and discussion On the basis of scree-plot for the PCA, up to 99% of the original dataset variability is now gathered in the first six new variables (components). Owing to this, all information about pollution in 20 monitoring locations gathered in the original 12 variables can be performed in the reduced space and explained by a set of calculated variables (PCs). According to the eigenvalue-one criterion, only the PCs with eigenvalues greater than one are considered as important ones. This criterion is based on the fact that the average eigenvalues of the autoscaled data is just one. The scree-plot shows that although the eigenvalues of PC3 and PC4 are slightly below one, the explained variances are quite high (7.57% and 7.09%) (Fig. 3).

The correlation matrix, presented in the Table 3, also confirms the high interdependence between particular variables. The high correlations obtained between some variables can easily be rationalized for example, between N-NO3 and N-tot, or P-tot and P-PO4. But there are also some less obvious correlations namely between N-NH4 and P-PO4. The redundancy of information suggests applying the FA in order to reduce the dimensionality of dataset as it was done in PCA. On the basis of the scree-plot (Fig. 3) the maximum information is gathered in the first four factors (95%). However only first two are significant and exceed eigenvalue of 1.

By the representation of the factor scores versus monitoring location the pollution sources in the river system may be identified. In Fig. 6 high scores correspond to high influence of the factor on the sampling site. Factors 1 and 2 are crucial for the condition of the monitoring locations 19 and 20 which are described as highly polluted by municipal discharge Factor 3 identifies the point 15 that is located on the estuary of the Brda’s tributary Kotomierzyca. The Kotomierzyca river is highly influenced by agriculture On the contrary the influence of factor 4 (pH) on the grouping pattern may be described as similarly low to points 2, 3, 4, 5 (cluster II, see Fig. 1) and 15.

Conclusions Brda is an important river in the whole river system in Poland. Its basin links two major river systems Odra and Vistula (through the Bydgoski channel). However, some differences between the monitoring locations are not recorded by the classification model used by the Inspection during the years By applying FA it was possible to identify main pollution sources in heavily polluted locations: 15, 19 and 20. It was found that water in the monitoring point no. 15 is mainly polluted by agriculture (high loads of N-tot, N-NO3), while the points 19 and 20, localized within the borders of the Bydgoszcz city, are influenced by the loads of the municipal pollution sources.

Thanks for your kind attention