Download presentation
Presentation is loading. Please wait.
Published byAubrey Richardson Modified over 8 years ago
1
Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/
2
People Centre for Molecular Design, University of Portsmouth Professor Martyn Ford Dr David Whitley Dr Shuang Cang (Mar - Sept 2000) Dr Abul Azad (Jan 2001 - ) Dr Antony Browne, London Guildhall University. Professor Philip Picton, University College Northampton.
3
1. Objectives The project aims: –to develop and validate techniques for extracting explicit information from bioinformatic data –to express this information as logical rules and decision trees –to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics
4
2. Methods for Extracting Information Artificial Neural Networks –good predictive accuracy –hard to decipher –often regarded as ‘black boxes’ Decision Trees –symbolic rules easier to interpret –more likely to reveal relationships in the data –allow behaviour of individual cases to be explained
5
3. Extracting Decision Trees The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.
6
4. Splitting Tests The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x 1, ¬x 2, x 3 }, where the x i are Boolean conditions. Start with a set of candidate tests –binary tests on each value for nominal features –binary tests on thresholds for real-valued features Use a beam search with a beam width of two. Initialize the beam with the candidate test that maximizes the information gain.
7
5. Splitting Tests (II) To each m-of-n test in the beam and each candidate test, apply two operators: m-of-n+1e.g. 2-of-{x 1, x 2 } => 2-of-{x 1, x 2, x 3 } m+1-of-n+1e.g. 2-of-{x 1, x 2 } => 3-of-{x 1, x 2, x 3 } Admit new tests to the beam if they increase the information gain and are significantly different (chi-squared) from existing tests.
8
6. Example: Substance P Binding to NK1 Receptors Substance P is a neuropeptide with amino acid sequence H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH 2 Wang et al. (1993) used the multipin technique to synthesize 512 = 2 9 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions, and measured binding potencies to central NK1 receptors. The objective was to identify the positions at which stereo-chemistry affects binding strength.
9
7. Application of Trepan A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. For each network a decision tree was grown using Trepan. The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins (1999).
10
8. A Typical Trepan Tree
11
9. Future Work Complete the implementation of the Trepan algorithm. –model the distribution of the input data and generate from this a set of query instances that are classified using the network and used as additional training cases during extraction of the tree. Extend the algorithm to enable the extraction of regression trees. Provide a Bayesian formulation for the decision tree extraction algorithm. Compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5). Apply Trepan to ligand-receptor binding problems.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.