Biological Data Mining A comparison of Neural Network and Symbolic Techniques
Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth
Collaborators Dr Anthony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. Professor Philip Picton School of Technology and Design, University College Northampton. Dr David Whitley Centre for Molecular Design, University of Portsmouth.
Objectives The project aims: – to develop & validate techniques for extracting explicit information from bioinformatic data –to express this information as logical rules and decision trees –to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics
Extracting information Artificial neural networks (ANNs) can be used to identify the non-linear relationships that underlie bioinformatic data, but... –trained ANNs do not lead to a concise and explicit model –specifying the underlying structure is therefore difficult –as a result, ANNs are often regarded as ‘black boxes’
Data Mining and Neural Networks Standard data mining algorithms exist (such as ID3 or C5) so why use an ANN? It would be advantageous if the rules extracted: –Give a better fit to the data with the same number of rules (i.e. explain the data more accurately); –Give the same fit to the data with less rules (i.e. explain the data more comprehensibly); or –Give both a better fit to the data and use less rules (i.e. explain the data more comprehensibly and more accurately).
Extracting Decision Trees The TREPAN procedure (Craven,1996) –extracts decision trees from ANNs –performs better than the symbolic learning algorithms ID3 and C5 –the current implementation is restricted to a particular network architecture, but –the underlying algorithm is independent of network architecture
Trepan Builds a decision tree representing the function the ANN has learnt by recursively partitioning the input space. Draws query instances by taking into account the distribution of instances in the problem domain. For real-valued features uses kernel density estimates to generate a model of the underlying data that is used to select instances for presentation to the network.
Trepan Builds the decision tree in a best-first manner: –as each node is added the fidelity of the decision tree to the ANN is maximised –this is done by examining the significance of the distributions at consecutive levels of the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones) Allows the user to control the size of the final tree by selecting appropriate stopping criteria.
Aims Implement the TREPAN algorithm in a portable format, independent of network architecture. Extend the algorithm to enable the extraction of regression trees. Provide a Bayesian formulation for the decision tree extraction algorithm. Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).
Aims Apply the extracted decision trees –to searches of bioinformatic databases protein databases genomic databases –to searches of cheminformatic databases chemical libraries natural product databases –to investigate ligand/receptor binding –to quantify molecular similarity/diversity –to identify new leads and optimise properties
Case study: ligand interaction with GPCRs 28 GPCRs a number of putative interaction sites 3 principal properties of amino acids (AAs) MLR results for 2 ligands