Download presentation
Presentation is loading. Please wait.
1
Unsupervised Pattern Recognition for the Interpretation of Ecological Data by Mark A. O’Connor Centre for Intelligent Environmental Systems School of Computing Staffordshire University
2
Outline Background River pollution & biological monitoring Pattern recognition Self-organising maps MIR-Max RPDS (River Pollution Diagnostic System) Conclusion
3
Background Work on use of artificial intelligence (AI) techniques started in 1989 by W. J. Walley and H. A. Hawkes Biological monitoring of river quality widely used for many years Current techniques based on subjective score systems, e.g. BMWP, and simplistic formulae, using only a fraction of the available data Current systems (e.g. RIVPACS) rely on ‘reference states’ – need to identify a set of ‘unpolluted’ sites
4
RIVPACS reference sites
5
Aims To produce a system for both classification and diagnosis of river quality Make full use of all the available data Not founded on subjective human evaluations (e.g. BMWP scores) No subjective selection of ‘reference sites’ – a holistic view of ‘clean’ and ‘dirty’ water biology
6
River pollution – ‘biomonitoring’ Chemical assessments alone do not fully reflect environmental quality of a river Organisms living in the river constitute a fundamental part of the river ecosystem ‘Benthic macroinvertebrates’ used: –Abundant –Easy to collect and identify –Sufficient range of diverse species –Confined to a particular part of the river
7
Interpretation of data Experts use two complementary processes when interpreting biological data ‘Plausible reasoning’ based on scientific knowledge of the ecological system ‘Pattern recognition’ based on experience of past cases Data from a site are interpreted ‘holistically’, rather than using e.g. specific ‘if … then …’ rules
8
Pattern recognition ‘Pattern recognition’ in AI terms attempts to classify or cluster sets of objects into groups using a specified set of features e.g. optical character recognition – the ‘objects’ are letters, the ‘features’ are the % of each square that is shaded, and the output ‘groups’ correspond to ‘a’, ‘b’, ‘c’, etc
9
PR system for river quality For river quality, the ‘objects’ are the river sites, the ‘features’ are the abundance levels of 76 selected creatures together with information such as width, depth, discharge, composition of river bed The ‘output groups’ correspond to varying river quality types or classes
10
Self-organising maps (SOMs) Output lattice or ‘map’ of ‘nodes’ represent the clusters, each node is associated with a ‘prototype’ set of features Training is ‘unsupervised’ New input data is classified according to which prototype it best matches Arranged so that nearby nodes on the output map represent similar patterns
11
River site SOM 20x20 output maps produced using SOM http://www.soc.staffs.ac.uk/research/ groups/cies/somview/somview.htm http://www.soc.staffs.ac.uk/research/ groups/cies/somview/somview.htm Nodes represented by points, referenced by axes. Contours produced using Statistica maths package. Heptageniidae (mayfly), generally indicates good water quality - sensitive to pollution.
12
Comparison of feature maps n Unionidae (Swan Mussels) only live in gently flowing rivers, thus the feature maps of river slope and the occurrence of Unionidae are seen to be inversely related.
13
Measurement of SOM quality 2 aspects: How well the data is classified (e.g. are very similar examples allocated to the same node/bin/neuron?) How well the output nodes are ordered (e.g. do nodes that are close together in output space contain examples that are similar?)
14
Classification Mathematical theory of information introduced by C. Shannon (1949) ‘Mutual information’ between two variables (X and Y, say) quantifies the amount of ‘information’ about X that is gained by a knowledge of Y A ‘good’ classification should maximise the M.I. between inputs (i.e. taxonomic and environmental data) and outputs (i.e. allocated nodes)
15
Ordering Also need to ensure a good ordering across the output ‘map’ (a preservation of the neighbourhood relations in the input space) Ordering can be measured using the correlation (r) between distances in data space (given some ‘distance’ or ‘dissimilarity’ measure between input feature sets) and Euclidean distances on the output map
16
MIR-Max Mutual Information and Regression Maximisation M.I. between set of n output classes C and an input feature X j which can take any of s possible values, is given by: Where = probability of finding attribute X j in its k-th state in class C i = prior probability of class C i = prior probability of finding attribute X j in its k-th state.
17
MIR-Max clustering ‘Clustering’ aim is to optimise the M.I. between the output groupings and the input variables (averaged over all of the variables) Start from a sub-optimal clustering, randomly allocating the input samples to the output classes Choose a sample and assess the effect of transferring from its current class (the ‘departure’ class) to another class (the ‘arrival’ class) Make the transfer if it produces an increase in M.I. Continue procedure until a stopping criterion is satisfied
18
MIR-Max ordering ‘Ordering’ aim is to optimise the representation of the output classes in a 2d output space Start from a random ordering of the output classes in an output space made up of a number of discrete locations Select 2 output locations and assess the effect of exchanging their contents If this results in an increase in the correlation r between distances in data space and distances in output space, make the swap Continue procedure until a stopping criterion is satisfied
19
MIR-Max results Initial testing found that MIR-Max outperformed SOM with respect to ‘clustering’ (as measured by average mutual information) MIR-Max specifically designed to maximise this measure; results show (on average) 18% improvement over SOM MIR-Max maps were also better ‘ordered’ overall than those produced by SOM; ‘global’ ordering was better, but ‘local’ ordering was worse
20
RPDS River Pollution Diagnostic System Developed for use by the British Environment Agency Based on a MIR-Max clustering/classification of spring and autumn samples from over 6000 sites across England and Wales ‘New’ samples are classified by RPDS, classifications help biologists to determine possible causes of pollution at the site
21
RPDS - feature maps
22
RPDS – cluster reports
23
RPDS – cluster ‘templates’
24
RPDS – sample input
25
RPDS - classification
26
Conclusion MIR-Max provides a means of organising and visualising complex high-dimensional data Can provide a powerful tool for environmental monitoring/classification and diagnosis. Find out more about AI and the environment from our website at: http://www.soc.staffs.ac.uk/research/groups/cies/ mo3@staffs.ac.uk
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.