Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rule Extraction From Trained Neural Networks Brian Hudson University of Portsmouth, UK.

Similar presentations


Presentation on theme: "Rule Extraction From Trained Neural Networks Brian Hudson University of Portsmouth, UK."— Presentation transcript:

1 Rule Extraction From Trained Neural Networks Brian Hudson University of Portsmouth, UK

2 Artificial Neural Networks Advantages High accuracy Robust Noisy data Disadvantages Lack of comprehensibilty

3 Trepan A method for extracting a decision tree from an artificial neural network (Craven, 1996). The tree is built by expanding nodes in a best first manner, producing an unbalanced tree. The splitting tests at the nodes are m-of-n tests e.g. 2-of-{x 1, ¬x 2, x 3 }, where the x i are Boolean conditions The network is used as an oracle to answer queries during the learning process.

4 Splitting Tests Start with a set of candidate tests binary tests on each value for nominal features binary tests on thresholds for real-valued features Find optimal splitting test by a beam search, initializing beam with candidate test maximizing the information gain.

5 Splitting Tests To each m-of-n test in the beam and each candidate test, apply two operators: m-of-(n+1) e.g. 2-of-{x 1, x 2 } => 2-of-{x 1, x 2, x 3 } (m+1)-of-(n+1) e.g. 2-of-{x 1, x 2 } => 3-of-{x 1, x 2, x 3 } Admit new tests to the beam if they increase the information gain and differ significantly (chi-squared) from existing tests.

6 Data Modelling The amount of training data reaching each node decreases with depth of tree. TREPAN creates new training cases by sampling the distributions of the training data empirical distributions for nominal inputs kernel density estimates for continuous inputs Apply oracle (i.e. neural network) to new training cases to assign output values.

7 Application to Bioinformatics Prediction of Splice Junction sites in Eukaryotic DNA

8 Splice Junction Sites

9 Consensus Sequences Donor -3 -2 -1 +1 +2 +3 +4 +5 +6 C/G A G | G T A/G A G T Acceptor -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T A G | G

10 EBI Dataset Clean dataset generated at EBI (Thanaraj, 1999) Donors training set: 567 positive, 943 negative test set: 229 positive, 373 negative Acceptors training set: 637 positive, 468 negative test set: 273 positive, 213 negative

11 Results

12 TREPAN Donor Tree 3 of {-2=A, -1=G, +3=A, +4=A, +5=G} Positive 869:74 Negative 43:533 C/G A G | G T A/G A G T YesNo

13 C5 Donor Tree (extract) p5=G p3=C or p3=T => NEGATIVE p3=A p2=G => POSITIVE p2=A p4=A or p4=G => POSITIVE p4=C or p4=T => NEGATIVE p2=C p4=A => POSITIVE else => NEGATIVE p2=T p6=A or p6=G => NEGATIVE p6=C or p6=T => POSITIVE p3=G p4=T => NEGATIVE p4=C p6=T => POSITIVE else => NEGATIVE

14 Trepan Acceptor Tree 1 of {-3=G, -5=G} NEGATIVE {-3=A} NEGATIVE POSITIVE NEGATIVE 2 of {+1!=G, -5=G} C/T … C/T A G | G

15 Application to Chemoinformatics 1. Learning general rules 2. Conformational Analysis 3. QSAR dataset

16 Oprea Dataset 137 diverse compounds Classification 62 leads, 75 drugs 14 descriptors (from Cerius-2) MW, MR, AlogP Ndonor, Nacceptor, Nrotbond Number of Lipinski violations T.I. Oprea, A.M. Davis, S.J. Teague & P.D. Leeson, “Is there a difference between Leads & Drugs? A Historical Perspective”, J. Chem. Inf. & Comput. Sci., 41, 1308-1315, (2001).

17 C5 tree MW <= 380 [ Mode: lead ] Rule of 5 Violations = 0 [ Mode: lead ] Hbond acceptor lead Hbond acceptor > 2 [ Mode: drug ] => drug Rule of 5 Violations > 0 [ Mode: lead ] => lead MW > 380 [ Mode: drug ] => drug

18 Trepan Oprea Tree 1 of { MW<296, MR<85 } Lead 52:3 Unclassified 12:49 MW<454 Drug 1:20

19 Conformational Analysis 300 conformations from 5ns MD simulation of rosiglitazone Classified by length of long axis into Extended – distance > 10A Folded – distance < 10A 8 torsion angles In house data.

20 Rosiglitazone Agonist of PPAR gamma Nuclear Receptor Regulates HDL/LDL and triglycerides Active ingredient of Avandia for Type II Diabetes

21 Distances

22 C5 tree T5 <= 269 [ Mode: extended ] T5 <= 52 [ Mode: extended ] T7 extended T7 > 185 [ Mode: folded ] T6 folded T6 > 75 [ Mode: extended ] T5 <= 41 [ Mode: folded ] T8 folded T8 > 249 [ Mode: extended ] => extended T5 > 41 [ Mode: extended ] => extended T5 > 52 [ Mode: extended ] T6 <= 73 [ Mode: extended ] T8 <= 242 [ Mode: extended ] T5 <= 7 [ Mode: extended ] T8 extended T8 > 22 [ Mode: folded ] => folded T5 > 7 [ Mode: extended ] => extended T8 > 242 [ Mode: extended ] => extended T6 > 73 [ Mode: extended ] => extended T5 > 269 [ Mode: folded ] => folded

23 Trepan Conformation Tree T5 < 180 Extended 133:0 Unclassified 2:5 2 of { T7 172} Folded 0:161

24 Ferreira Dataset “typical” QSAR dataset 48 HIV-1 Protease inhibitors Activity as pIC50 Low pIC50 < 8.0 High pIC50 > 8.0 14 descriptors (mostly topological) R. Kiralj and M.M.C. Ferreira, “A-priori Molecular Descriptors in QSAR : a case of HIV-1 protease inhibitors I. The Chemometric Approach”, J. Mol. Graph. & Modell. 21, 435-448, (2003)

25 Original Results PLS model Activity determined by X9,X11,X10,X13 R 2 = 0.91, Q 2 =0.85, Ncomps=3

26 C5 tree X11 <= 2.5 [ Mode: low ] X13 low X13 > 16.7 [ Mode: high ] => high X11 > 2.5 [ Mode: high ] => high

27 Trepan Ferreira Tree 1 of { X13<16.1, X9<3.4 } High 1:24 X1<552 Low 17:1 Low 4:1 High 0:1 X6<0.04

28 Accuracy

29 Conclusions Reasonable Accuracy Comprehensible Rules

30 Acknowledgements David Whitley. Tony Browne. Martyn Ford. BBSRC grant reference BIO/12005.


Download ppt "Rule Extraction From Trained Neural Networks Brian Hudson University of Portsmouth, UK."

Similar presentations


Ads by Google