Biological Data Mining A comparison of Neural Network and Symbolic Techniques

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

1. Objectives The project aims: –to develop and validate techniques for extracting explicit information from bioinformatic data –to express this information as logical rules and decision trees –to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

2. Extracting information Artificial neural networks can be trained to reproduce the non-linear relationships underlying bioinformatic data with good predictive accuracy –but it is often hard to comprehend those relationships from the internal structure of the network –with the result that networks are often regarded as ‘black boxes’. Decision trees using symbolic rules are easier to interpret – leading to a greater likelihood of understanding the relationships in the data –allowing the behaviour of individual cases to be explained.

3. Extracting Decision Trees The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

4. Splitting Tests The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x 1, ¬x 2, x 3 }, where the x i are Boolean conditions. Start with a set of candidate tests –binary tests on each value for nominal features –binary tests on thresholds for real-valued features Use a beam search with a beam width of two. Initialize the beam with the candidate test that maximizes the information gain.

5. Splitting Tests (II) To each m-of-n test in the beam and each candidate test, apply two operators: m-of-n+1e.g. 2-of-{x 1, x 2 } => 2-of-{x 1, x 2, x 3 } m+1-of-n+1e.g. 2-of-{x 1, x 2 } => 3-of-{x 1, x 2, x 3 } Admit new tests to the beam if they increase the information gain and are significantly different (chi-squared) from existing tests.

6. Example: Substance P Binding to NK1 Receptors Substance P is a neuropeptide with the sequence: H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH 2 Wang et al. used the multipin technique to synthesize 512 = 2 9 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions The aim was to measure binding potencies to NK1 receptors & identify the positions at which stereo- chemistry affects binding strength.

7. Application of Trepan A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. For each network a decision tree was grown using Trepan. The trees showed high fidelity with the networks on a 10% test set.

8. Results Binding activity was determined by five positions, viz. –H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met- NH 2 The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins –Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & J Ø rgensen).

9. A Typical Trepan Tree

10. Test set confusion matrix: tree versus network

11. Test set confusion matrix: tree versus observed

12. Future Work Complete the implementation of the Trepan algorithm. –model the distribution of the input data and generate a set of query instances to be classified by the network & used as additional training cases during tree extraction. Extend the algorithm to enable the extraction of regression trees. Provide a Bayesian formulation for the decision tree extraction algorithm.

13. Future Applications Apply Trepan to ligand-receptor binding problems. –compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

14. References Wang J-X et al. (1993) Study of stereo-requirements of substance P binding to NK1 receptors using analogues with systematic D-amino acid replacements. Biorganic & Medicinal Chemistry Letters, 3, 451-456. Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad Jan 2001 - Research Fellows

Collaborators Dr Antony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

Biological Data Mining A comparison of Neural Network and Symbolic Techniques

Similar presentations

Presentation on theme: "Biological Data Mining A comparison of Neural Network and Symbolic Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biological Data Mining A comparison of Neural Network and Symbolic Techniques

Similar presentations

Presentation on theme: "Biological Data Mining A comparison of Neural Network and Symbolic Techniques"— Presentation transcript:

Similar presentations

About project

Feedback