Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.

Similar presentations


Presentation on theme: "Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane."— Presentation transcript:

1 Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook, DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington

2 Outline b Chemical Toxicity Database b Motivation and Goal b Knowledge Discovery in Databases (KDD) b SUBDUE Knowledge Discovery System b Experiments with Unsupervised SUBDUE b Experiments with Supervised SUBDUE b Discussion of Results b Conclusions b Future Work

3 Chemical Toxicity Database b Carcinogenesis Prediction Problem b Toxicology Evaluation Challenge b Domain: Compounds + -Total Compounds + -Total Training set 162 136 298 Training set 162 136 298 Experimental set  27  25 69 Experimental set  27  25 69

4 Motivation and Goal b Ever-increasing number of chemical compounds b Needs analysis to obtain the Structure-Activity relationships of a compound relationships of a compound b Determine SUBDUE’s applicability to chemical toxicity domain toxicity domain

5 Knowledge Discovery in Databases (KDD) b Process of identifying valid, novel, potentially useful and understandable patterns in data useful and understandable patterns in data b Goal of Knowledge Discovery: Verification Verification Discovery Discovery b Data mining methods b Model Representation, Evaluation and Search

6 Steps in KDD b Identify the goal of the process b Collect, create and prepare the dataset b Select the data mining method b Select the data mining algorithm b Transform the data b Execute the algorithm b Interpret/evaluate the discovered patterns b Consolidate the knowledge discovered

7 SUBDUE Knowledge Discovery System b SUBDUE discovers patterns [substructures] in structural data sets object triangle object square on shape shape Vertices: objects or attributes Edges: relationships 4 instances of

8 SUBDUE - Input Representation b Each atom is represented as a vertex with directed edges to the name, type and the partial directed edges to the name, type and the partial charge of the atom charge of the atom b Bonds are represented as undirected edges b Each group is represented as a vertex having a string label specifying the group name with string label specifying the group name with directed edges to all participating atom directed edges to all participating atom vertices vertices

9 SUBDUE - Input Representation b Representation used in Unsupervised SUBDUE A vertex having a string label specifying the A vertex having a string label specifying the alert with directed edges to all the atoms in alert with directed edges to all the atoms in the compound the compound b Representation used in Supervised SUBDUE A vertex for all the compounds with string label A vertex for all the compounds with string label compound compound The compound vertex has directed edges to all The compound vertex has directed edges to all the vertices representing the activity of an the vertices representing the activity of an alert on a compound alert on a compound

10 Unsupervised SUBDUE Input Representation Example C 0.062 p t n Ames 0.063 10 C Methyl Atom p tn gr po 1 n - Name t - Type p - Partial charge po - Positive gr - group

11 Supervised SUBDUE Input Representation Example C 0.062 p t n Com 0.063 10 C Methyl Atom p tn gr contains 1 Ames Positive n - Name t - Type p - Partial charge gr - group Com - Compound

12 SUBDUE - Model Evaluation b Minimum Description Length Principle Best theory to describe any graph Best theory to describe any graph Minimize I(S) + I(G/S) Minimize I(S) + I(G/S) b Graph Compression

13 Other important Concepts of SUBDUE b Inexact Graph Match Approach b Concept - Learning b Predefined Substructures

14 Unsupervised SUBDUE - Methodology b Training set further divided b 3 approaches to determine carcinogenicity of compounds in experimental set -- Apply SUBDUE individually to the compounds -- Inclusion of pre-defined substructures -- Check for matching of substructure in the compound to be classified compound to be classified

15 Unsupervised SUBDUE - Results atom 10 c n tp 0.062 atom br n tp 0.057 1 3 b Third approach used to classify compounds in experimental set experimental set b Accuracy Level -> 0.322 b Cyanate & ether groups are also discovered to be indicators of carcinogenic activity be indicators of carcinogenic activity

16 Supervised SUBDUE - Methodology b Create set of indicators of carcinogenic activity b Create set of indicators of noncarcinogenic activity activity b Calculate value of substructures discovered in carcinogenic and noncarcinogenic set carcinogenic and noncarcinogenic set b Select a set of substructures to be used in classifying compounds in experimental set classifying compounds in experimental set

17 Supervised SUBDUE - Methodology b Check for the existence of these substructures in the compound to be classified the compound to be classified b Calculate the Carcinogenic Activity Value of the compound compound b Calculate the NonCarcinogenic Activity Value of the compound compound b Determine the activity of the compound

18 Supervised SUBDUE - Results b A set of 12 substructures discovered by SUBDUE used to classify compounds in the experimental set b 6 substructures from carcinogenic set include substructures which form part of groups like amino, di10, methyl, ether, halide10 and substructure which indicates compound testing positive on AMES, Salmonella, etc. b 6 substructures from noncarcinogenic set include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates compound testing negative on AMES, Salmonella, etc.

19 Supervised SUBDUE - Substructure Example - Carcinogenic Set Ames Salmonella Salmonella_n Compound positive

20 Supervised SUBDUE - Substructure Example - Carcinogenic Set Cl -0.024 p gr t n -0.123 93 10 C Atom Halide10 gr p t n n - Name t - Type p - Partial charge gr - group

21 Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Ames Salmonella Cytogen_ca Compound negative

22 Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Cl Atom 0.477 p t n gr -0.124 93 10 C Atom A-H p t n gr n - Name t - Type p - Partial charge gr - group A-H - Alkyl Halide

23 Supervised SUBDUE - Results b PTE-1 Results: Compounds + -Total Compounds + -Total PTE-1 2019 39 PTE-1 2019 39 Correct Prediction 12 6 18 Correct Prediction 12 6 18 Incorrect Prediction 813 22 Incorrect Prediction 813 22 b Accuracy: 0.6 (+ ), 0.315 (-), 0.462 (total)

24 Supervised SUBDUE - Results b PTE-2 Results: Compounds + -Total Compounds + -Total PTE-2 7 6 13 * PTE-2 7 6 13 * Correct Prediction 4 3 7 Correct Prediction 4 3 7 Incorrect Prediction 3 3 6 Incorrect Prediction 3 3 6 * : # of compounds whose activity is known * : # of compounds whose activity is known b Accuracy : 0.572 (+ ), 0.5 (-), 0.538 (total)

25 Results - Discussion b Unsupervised SUBDUE successful in discovering lead indicators of carcinogenic activity lead indicators of carcinogenic activity b Supervised SUBDUE also successful in discovering lead indicators of carcinogenic discovering lead indicators of carcinogenic activity activity b ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62) b Ashby, TOPKAT are other toxicity prediction methods methods

26 Conclusions b Consistent with results obtained by logic based systems like PROGOL systems like PROGOL b Prefer to use Concept Learner when positive and negative examples of target concept available negative examples of target concept available b SUBDUE is capable of discovering lead indicators of carcinogenic/noncarcinogenic indicators of carcinogenic/noncarcinogenic activity in chemical toxicity domain. activity in chemical toxicity domain.

27 Future Work b PTE-3 Evaluation Challenge b Trimmed Data Sets (Partial Charge) b Newer Version of Concept Learning SUBDUE being developed developed

28 Reference http://cygnus.uta.edu/subdue


Download ppt "Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane."

Similar presentations


Ads by Google