Rule Extraction from trained Neural Networks Brian Hudson, Centre for Molecular Design, University of Portsmouth
Artificial Neural Networks Advantages High accuracy Robust Noisy data Disadvantages Lack of comprehensibilty
Rule Extraction Rule extraction from trained Neural Networks High fidelity to original network TREPAN features Best-first tree growing Sampling query instances M of N rules
Bioinformatics applications Black box solutions Neural Networks Hidden Markov models Good test for TREPAN methodology
Gene Splicing Well known bioinformatics problem For details & links see
The “answer” is known Donor sequence C/G A G | G T A/G A G T Acceptor sequence C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T A G |G
EBI clean dataset Tidied up dataset generated at EBI Donors training set 567 real & 943 unreal test set 229 real & 373 unreal Acceptors training set 637 real & 468 unreal test set 273 real & 213 unreal
Summary of results
TREPAN tree for donors 3 of {p-2 =A, p-1=G, p+3=A, p+4=A, p+5=G} REAL 869/74 UNREAL 43/533 Network : 28x10x1 Training : 92.25% Testing : 90.7% C/G A G | G T A/G A G T
C5 tree for donors (part) p5=G p3=C or p3=T => FALSE p3=A p2=G => REAL p2=A p4=A or p4=G => REAL p4=C or p4=T => FALSE p2=C p4=A => REAL else => FALSE p2=T p6=A or p6=G => FALSE p6=C or p6=T => REAL p3=G p4=T => FALSE p4=C p6=T => REAL else => FALSE p4=A p2=C or p2=G or p2=T => REAL p2=A p-3=T => FALSE else => REAL p4=G p2=A or p2=C or p2=T => FALSE p2=G p1=A or p1=C => REAL p1=G or p1=T => FALSE
TREPAN tree for acceptors 1 of {p-3 =G, p-5=G} UNREAL 26/190 {p-3 =A} UNREAL 25/95 REAL 571/153 Network : 40x13x1 Training : 80.2% Testing : 80.9% UNREAL 13/32 2 of {p+1!=G, p-5=G}
Conclusions Reasonable prediction rate ‘explains’ predictions of ANN comprehensible rules more suited to bioinformatics?
Acknowledgements BBSRC/EPSRC Dave Whitley (CMD) Tony Browne (LGU) Martyn Ford (CMD)