Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.

Similar presentations


Presentation on theme: "Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London."— Presentation transcript:

1 Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London

2 Predictive Toxicology Drug companies lose lots of money Developing drugs which are toxic Machine Learning Problem: Given + and -, why are + toxic (active) Machine Learning Approaches: ILP, Neural Nets, Linear regression, CART Two more today: Automated Theory Formation Template search Important: scientists want: Predictive accuracy and scientific knowledge

3 Automated Theory Formation (ATF) Questions Given some background information Concepts, hypotheses (axioms) And some objects of interest Numbers, Molecules, etc. Find something interesting Interesting things could be: Concepts, examples, hypotheses, explanations

4 ATF Overview Scientific theories contain (at least): Concepts: salt, acid, base Hypotheses: acid + base => salt + water Explanations: transfer of electrons, dissolving So, ATF should do (at least): Concept formation, Conjecture making Hypothesis proving and disproving. Also needs to: Measure interestingness, present results, etc.

5 HR Theory Formation System Developed in maths Designed to be general purpose system Concept-based theory formation Tries to make concept Makes conjecture when it can’t make a concept Tries to explain conjectures Measures of interestingness To direct a heuristic search

6 Concept Formation in HR 10 General Production Rules Take in old concepts, produce new concepts Split Negate Size Split Compose [a,b] : b|a [a,n]:n = |{b:b|a}| [a]:2=|{b:b|a}| [a] : 2|a [a] : not 2|a [a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)

7 Conjecture Making Empirical checks are performed After each attempt to invent a new concept If the concept has no examples Makes non-existence conjecture If concept has same examples as previous Makes an equivalence conjecture If another concept subsumes the concept Makes an implication conjecture

8 Conjecture Extraction Suppose HR makes equivalence conjecture: P(a) & Q(a)  R(a) & S(a) Extracts: P(a) & Q(a) => R(a), P(a) & Q(a) => S(a) R(a) & S(a) => P(a), R(a) & S(a) => Q(a) Tries to Extract: P(a) => R(a), Q(a) => R(a), etc. Prime implicates (require proving, though) Important: gets Horn Clauses Can be expressed in Prolog…..

9 Greatest Hits (in Maths) Pre-processing constraint problems Learning properties of residues classes Inventing integer sequences Puzzle generation Adding to the TPTP library Setting mathematical tutorial questions See Springer Book for Details

10 Long term aim in Bioinformatics Develop an ATF system Working in biological domains Biologist provides little background info In a format they are happy with Program provides results Intelligent, interesting, not too much, And very little rubbish

11 Some short term aims in Bioinformatics HR can work with biological data Takes input similar to Progol Use HR to solve ML problems See how bad an idea that is Use theory formation to improve ML Integrate HR and Progol somehow Push the envelope Give biologists more information

12 Approach to ML Tasks Give HR the same input as Progol Get it to form a theory Look at the theory Extract concepts which look similar to the target concept Not a goal-based approach Bad idea (slow) Implemented a reactive search Much faster

13 Mutagenesis(42) Data Mutagenesis related to carcinogenisis 42 drugs supplied with atom-bond details Atom type, number & charge, bond type (1-8) 13 are mutagenic (active), 29 not active Progol learned this concept (88% accurate) active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E) c,21 ?? 12

14 HR’s Results Using reactive search, four PRs, 30K steps HR learned these concepts: active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E) active(A) :- bond(A,B,C,D), atm(B,E,21), atm(C,F,38) active(A) :- \+ (bond(A,B,C,D), atm(B,F,21), bond(A,C,D,E)) Also 88% accurate But, Progol’s answer “better” Because higher information content (fewer ?s) Biologists sometimes want more information ?,21 ?? 1 ?

15 But….. HR also made these equivalence conjectures And extracted them (+100 more) for us atm(B,X1,21)  atm(B,c,21) atm(B,X1,38)  atm(B,n,38) bond(A,B,C,X1) & atm(C,X2,38)  bond(A,B,C,1) & atm(C,X3,38) bond(A,X1,B,X2) & atm(B,X3,38)  bond(A,B,X4,2), atm(B,X5,38) We used these to re-write HR’s answer By hand, but hope to automate

16 Giving us this answer: Remember that Progol’s Answer was: c,21 ?? 12 n,38 ? 12 So, we filled in one of the blanks!

17 Are we making a meal of this? Yes, possibly for the mutagenesis data I was worried about the difficulty of this problem In the last fortnight: “template search” 200-line Prolog program And can be distributed over multiple processors And can be easily understood by biologists And gets these results….

18 Template search – Results More specific substructure found: (88% accurate on 42, 88% cross validation) c,21 n,38o,40 12 2 More general substructure found: Also 88% accurate c,21 1

19 Template Search - Assumptions Connected substructures Are interesting answers Progol’s answers are all substructures More specific substructures are OK Biologists may even want lots of information Don’t forget that they want to do science Each learned concept will be true of At least one active (positive) molecule

20 Template Search - Overview User specifies: Template for substructures How general the solution can become (IC limit) Example 3 ?s allowed in above template ?,? ?? Mitchell: FIND-S routine (very simple) Algorithm starts with the first positive And extracts all the substructures (in template) Then takes the next positive and for each substructure It finds the least general generalisation So the new substructure is true of both +ves Do not over-generalise (IC limit)

21 Using the results Procedure finds many results Ranging from specific to general So, user must be advised on usage Take the most specific best Take the most general best Take a disjunction of all best answers Take a more intelligent disjunction Cross validation results required To tell user predictive accuracy

22 More Results Three in a row template (2 minutes) 6 answers with 88% over the 42 examples [c,21,_,_,o,_,_,_] [c,21,_,_,_,_,1,_] [c,21,_,_,_,_,1,2] [c,21,_,_,o,_,1,2] [_,_,c,21,n,38,7,1] [c,21,n,38,o,40,1,2] Take most general/specific: 88% 1-fold cross-val Take disjunction of all: 88% cross-validation Take more intelligent disjunction: (95% accurate on 42, 80% cross validation) c,21 n,38o,40 12 c,? c,22 ? -0.132 c,195 c,22 h,3 0.145 1771

23 Conclusions & Future Work Automated Theory Formation May be useful to bioinformatics Use HR’s theory to improve Progol’s results Possibly by pre-processing Progol’s input Or by post-processing the learned concept Template search Maybe a good idea? Simple, push envelope Nice results for the Mutagenesis(42) dataset Distribute the process Processor per Positive (PPP)


Download ppt "Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London."

Similar presentations


Ads by Google