Unsupervised Pattern Recognition for the Interpretation of Ecological Data by Mark A. O’Connor Centre for Intelligent Environmental Systems School of Computing.

Slides:



Advertisements
Similar presentations
Decision Support Tools for River Quality Management
Advertisements

Applications of one-class classification
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Adapting Ocean Surveys to the Observed Fields Characteristics Maria-João Rendas I3S, CNRS-UNSA.
CLUSTERING PROXIMITY MEASURES
K Means Clustering , Nearest Cluster and Gaussian Mixture
Lec 12: Rapid Bioassessment Protocols (RBP’s)
Our Water, Our Resource, Our Responsibility DRAFT Module 4: Water and Biodiversity Unit 2: Assessing Biodiversity.
Self Organization of a Massive Document Collection
Bayesian Decision Theory
5/16/2015Intelligent Systems and Soft Computing1 Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.
Kohonen Self Organising Maps Michael J. Watts
Artificial neural networks:
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Modeling Human Reasoning About Meta-Information Presented By: Scott Langevin Jingsong Wang.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
X0 xn w0 wn o Threshold units SOM.
Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
CIS 678 Artificial Intelligence problems deduction, reasoning knowledge representation planning learning natural language processing motion and manipulation.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Slides are based on Negnevitsky, Pearson Education, Lecture 8 Artificial neural networks: Unsupervised learning n Introduction n Hebbian learning.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Brian Hemsley- Flint B.Sc. C.Biol. M.I.Biol. Northeast Region Ecology Team Leader.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Life Cycle Analysis and Resource Management Dr. Forbes McDougall Procter & Gamble UK.
Lecture 09 Clustering-based Learning
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Image Segmentation by Clustering using Moments by, Dhiraj Sakumalla.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
ARROW: system for the evaluation of the status of waters in the Czech Republic Jiří Jarkovský 1) Institute of Biostatistics and Analyses, Masaryk University,
A two-stage approach for multi- objective decision making with applications to system reliability optimization Zhaojun Li, Haitao Liao, David W. Coit Reliability.
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Machine Learning Applications in Biological Classification of River Water Quality Saso Dzeroski, Jasna Grobovic and William J. Walley 조 동 연.
Self-organizing maps (SOMs) and k-means clustering: Part 1 Steven Feldstein The Pennsylvania State University Trieste, Italy, October 21, 2013 Collaborators:
1 Enviromatics Environmental sampling Environmental sampling Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis,
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 13 Multiple Regression
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Environmental Assessment and Sustainability CIV913 BIOLOGICAL ASSESSMENT of River Water Quality Assessing the biological quality of fresh waters : Wright,
Semiconductors, BP&A Planning, DREAM PLAN IDEA IMPLEMENTATION.
Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Colour and Texture. Extract 3-D information Using Vision Extract 3-D information for performing certain tasks such as manipulation, navigation, and recognition.
Foundational Issues Machine Learning 726 Simon Fraser University.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
Self-Organizing Network Model (SOM) Session 11
Data Mining, Neural Network and Genetic Programming
Tuesday August 23,2016 Notes –Binder Check - 08/14, every work should be completed. GPS – SEV5. Students will recognize that human beings are part of the.
Self organizing networks
K Nearest Neighbor Classification
Innovative ecological modelling for water quality impact
Unsupervised Learning
Presentation transcript:

Unsupervised Pattern Recognition for the Interpretation of Ecological Data by Mark A. O’Connor Centre for Intelligent Environmental Systems School of Computing Staffordshire University

Outline Background River pollution & biological monitoring Pattern recognition Self-organising maps MIR-Max RPDS (River Pollution Diagnostic System) Conclusion

Background Work on use of artificial intelligence (AI) techniques started in 1989 by W. J. Walley and H. A. Hawkes Biological monitoring of river quality widely used for many years Current techniques based on subjective score systems, e.g. BMWP, and simplistic formulae, using only a fraction of the available data Current systems (e.g. RIVPACS) rely on ‘reference states’ – need to identify a set of ‘unpolluted’ sites

RIVPACS reference sites

Aims To produce a system for both classification and diagnosis of river quality Make full use of all the available data Not founded on subjective human evaluations (e.g. BMWP scores) No subjective selection of ‘reference sites’ – a holistic view of ‘clean’ and ‘dirty’ water biology

River pollution – ‘biomonitoring’ Chemical assessments alone do not fully reflect environmental quality of a river Organisms living in the river constitute a fundamental part of the river ecosystem ‘Benthic macroinvertebrates’ used: –Abundant –Easy to collect and identify –Sufficient range of diverse species –Confined to a particular part of the river

Interpretation of data Experts use two complementary processes when interpreting biological data ‘Plausible reasoning’ based on scientific knowledge of the ecological system ‘Pattern recognition’ based on experience of past cases Data from a site are interpreted ‘holistically’, rather than using e.g. specific ‘if … then …’ rules

Pattern recognition ‘Pattern recognition’ in AI terms attempts to classify or cluster sets of objects into groups using a specified set of features e.g. optical character recognition – the ‘objects’ are letters, the ‘features’ are the % of each square that is shaded, and the output ‘groups’ correspond to ‘a’, ‘b’, ‘c’, etc

PR system for river quality For river quality, the ‘objects’ are the river sites, the ‘features’ are the abundance levels of 76 selected creatures together with information such as width, depth, discharge, composition of river bed The ‘output groups’ correspond to varying river quality types or classes

Self-organising maps (SOMs) Output lattice or ‘map’ of ‘nodes’ represent the clusters, each node is associated with a ‘prototype’ set of features Training is ‘unsupervised’ New input data is classified according to which prototype it best matches Arranged so that nearby nodes on the output map represent similar patterns

River site SOM 20x20 output maps produced using SOM groups/cies/somview/somview.htm groups/cies/somview/somview.htm Nodes represented by points, referenced by axes. Contours produced using Statistica maths package. Heptageniidae (mayfly), generally indicates good water quality - sensitive to pollution.

Comparison of feature maps n Unionidae (Swan Mussels) only live in gently flowing rivers, thus the feature maps of river slope and the occurrence of Unionidae are seen to be inversely related.

Measurement of SOM quality 2 aspects: How well the data is classified (e.g. are very similar examples allocated to the same node/bin/neuron?) How well the output nodes are ordered (e.g. do nodes that are close together in output space contain examples that are similar?)

Classification Mathematical theory of information introduced by C. Shannon (1949) ‘Mutual information’ between two variables (X and Y, say) quantifies the amount of ‘information’ about X that is gained by a knowledge of Y A ‘good’ classification should maximise the M.I. between inputs (i.e. taxonomic and environmental data) and outputs (i.e. allocated nodes)

Ordering Also need to ensure a good ordering across the output ‘map’ (a preservation of the neighbourhood relations in the input space) Ordering can be measured using the correlation (r) between distances in data space (given some ‘distance’ or ‘dissimilarity’ measure between input feature sets) and Euclidean distances on the output map

MIR-Max Mutual Information and Regression Maximisation M.I. between set of n output classes C and an input feature X j which can take any of s possible values, is given by: Where = probability of finding attribute X j in its k-th state in class C i = prior probability of class C i = prior probability of finding attribute X j in its k-th state.

MIR-Max clustering ‘Clustering’ aim is to optimise the M.I. between the output groupings and the input variables (averaged over all of the variables) Start from a sub-optimal clustering, randomly allocating the input samples to the output classes Choose a sample and assess the effect of transferring from its current class (the ‘departure’ class) to another class (the ‘arrival’ class) Make the transfer if it produces an increase in M.I. Continue procedure until a stopping criterion is satisfied

MIR-Max ordering ‘Ordering’ aim is to optimise the representation of the output classes in a 2d output space Start from a random ordering of the output classes in an output space made up of a number of discrete locations Select 2 output locations and assess the effect of exchanging their contents If this results in an increase in the correlation r between distances in data space and distances in output space, make the swap Continue procedure until a stopping criterion is satisfied

MIR-Max results Initial testing found that MIR-Max outperformed SOM with respect to ‘clustering’ (as measured by average mutual information) MIR-Max specifically designed to maximise this measure; results show (on average) 18% improvement over SOM MIR-Max maps were also better ‘ordered’ overall than those produced by SOM; ‘global’ ordering was better, but ‘local’ ordering was worse

RPDS River Pollution Diagnostic System Developed for use by the British Environment Agency Based on a MIR-Max clustering/classification of spring and autumn samples from over 6000 sites across England and Wales ‘New’ samples are classified by RPDS, classifications help biologists to determine possible causes of pollution at the site

RPDS - feature maps

RPDS – cluster reports

RPDS – cluster ‘templates’

RPDS – sample input

RPDS - classification

Conclusion MIR-Max provides a means of organising and visualising complex high-dimensional data Can provide a powerful tool for environmental monitoring/classification and diagnosis. Find out more about AI and the environment from our website at: