Download presentation
Presentation is loading. Please wait.
1
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane J. Cook, DR. Lynn Peterson Department of Computer Science and Engineering University of Texas at Arlington
2
Outline b Chemical Toxicity Database b Motivation and Goal b Knowledge Discovery in Databases (KDD) b SUBDUE Knowledge Discovery System b Experiments with Unsupervised SUBDUE b Experiments with Supervised SUBDUE b Discussion of Results b Conclusions b Future Work
3
Chemical Toxicity Database b Carcinogenesis Prediction Problem b Toxicology Evaluation Challenge b Domain: Compounds + -Total Compounds + -Total Training set 162 136 298 Training set 162 136 298 Experimental set 27 25 69 Experimental set 27 25 69
4
Motivation and Goal b Ever-increasing number of chemical compounds b Needs analysis to obtain the Structure-Activity relationships of a compound relationships of a compound b Determine SUBDUE’s applicability to chemical toxicity domain toxicity domain
5
Knowledge Discovery in Databases (KDD) b Process of identifying valid, novel, potentially useful and understandable patterns in data useful and understandable patterns in data b Goal of Knowledge Discovery: Verification Verification Discovery Discovery b Data mining methods b Model Representation, Evaluation and Search
6
Steps in KDD b Identify the goal of the process b Collect, create and prepare the dataset b Select the data mining method b Select the data mining algorithm b Transform the data b Execute the algorithm b Interpret/evaluate the discovered patterns b Consolidate the knowledge discovered
7
SUBDUE Knowledge Discovery System b SUBDUE discovers patterns [substructures] in structural data sets object triangle object square on shape shape Vertices: objects or attributes Edges: relationships 4 instances of
8
SUBDUE - Input Representation b Each atom is represented as a vertex with directed edges to the name, type and the partial directed edges to the name, type and the partial charge of the atom charge of the atom b Bonds are represented as undirected edges b Each group is represented as a vertex having a string label specifying the group name with string label specifying the group name with directed edges to all participating atom directed edges to all participating atom vertices vertices
9
SUBDUE - Input Representation b Representation used in Unsupervised SUBDUE A vertex having a string label specifying the A vertex having a string label specifying the alert with directed edges to all the atoms in alert with directed edges to all the atoms in the compound the compound b Representation used in Supervised SUBDUE A vertex for all the compounds with string label A vertex for all the compounds with string label compound compound The compound vertex has directed edges to all The compound vertex has directed edges to all the vertices representing the activity of an the vertices representing the activity of an alert on a compound alert on a compound
10
Unsupervised SUBDUE Input Representation Example C 0.062 p t n Ames 0.063 10 C Methyl Atom p tn gr po 1 n - Name t - Type p - Partial charge po - Positive gr - group
11
Supervised SUBDUE Input Representation Example C 0.062 p t n Com 0.063 10 C Methyl Atom p tn gr contains 1 Ames Positive n - Name t - Type p - Partial charge gr - group Com - Compound
12
SUBDUE - Model Evaluation b Minimum Description Length Principle Best theory to describe any graph Best theory to describe any graph Minimize I(S) + I(G/S) Minimize I(S) + I(G/S) b Graph Compression
13
Other important Concepts of SUBDUE b Inexact Graph Match Approach b Concept - Learning b Predefined Substructures
14
Unsupervised SUBDUE - Methodology b Training set further divided b 3 approaches to determine carcinogenicity of compounds in experimental set -- Apply SUBDUE individually to the compounds -- Inclusion of pre-defined substructures -- Check for matching of substructure in the compound to be classified compound to be classified
15
Unsupervised SUBDUE - Results atom 10 c n tp 0.062 atom br n tp 0.057 1 3 b Third approach used to classify compounds in experimental set experimental set b Accuracy Level -> 0.322 b Cyanate & ether groups are also discovered to be indicators of carcinogenic activity be indicators of carcinogenic activity
16
Supervised SUBDUE - Methodology b Create set of indicators of carcinogenic activity b Create set of indicators of noncarcinogenic activity activity b Calculate value of substructures discovered in carcinogenic and noncarcinogenic set carcinogenic and noncarcinogenic set b Select a set of substructures to be used in classifying compounds in experimental set classifying compounds in experimental set
17
Supervised SUBDUE - Methodology b Check for the existence of these substructures in the compound to be classified the compound to be classified b Calculate the Carcinogenic Activity Value of the compound compound b Calculate the NonCarcinogenic Activity Value of the compound compound b Determine the activity of the compound
18
Supervised SUBDUE - Results b A set of 12 substructures discovered by SUBDUE used to classify compounds in the experimental set b 6 substructures from carcinogenic set include substructures which form part of groups like amino, di10, methyl, ether, halide10 and substructure which indicates compound testing positive on AMES, Salmonella, etc. b 6 substructures from noncarcinogenic set include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates compound testing negative on AMES, Salmonella, etc.
19
Supervised SUBDUE - Substructure Example - Carcinogenic Set Ames Salmonella Salmonella_n Compound positive
20
Supervised SUBDUE - Substructure Example - Carcinogenic Set Cl -0.024 p gr t n -0.123 93 10 C Atom Halide10 gr p t n n - Name t - Type p - Partial charge gr - group
21
Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Ames Salmonella Cytogen_ca Compound negative
22
Supervised SUBDUE - Substructure Example - NonCarcinogenic Set Cl Atom 0.477 p t n gr -0.124 93 10 C Atom A-H p t n gr n - Name t - Type p - Partial charge gr - group A-H - Alkyl Halide
23
Supervised SUBDUE - Results b PTE-1 Results: Compounds + -Total Compounds + -Total PTE-1 2019 39 PTE-1 2019 39 Correct Prediction 12 6 18 Correct Prediction 12 6 18 Incorrect Prediction 813 22 Incorrect Prediction 813 22 b Accuracy: 0.6 (+ ), 0.315 (-), 0.462 (total)
24
Supervised SUBDUE - Results b PTE-2 Results: Compounds + -Total Compounds + -Total PTE-2 7 6 13 * PTE-2 7 6 13 * Correct Prediction 4 3 7 Correct Prediction 4 3 7 Incorrect Prediction 3 3 6 Incorrect Prediction 3 3 6 * : # of compounds whose activity is known * : # of compounds whose activity is known b Accuracy : 0.572 (+ ), 0.5 (-), 0.538 (total)
25
Results - Discussion b Unsupervised SUBDUE successful in discovering lead indicators of carcinogenic activity lead indicators of carcinogenic activity b Supervised SUBDUE also successful in discovering lead indicators of carcinogenic discovering lead indicators of carcinogenic activity activity b ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62) b Ashby, TOPKAT are other toxicity prediction methods methods
26
Conclusions b Consistent with results obtained by logic based systems like PROGOL systems like PROGOL b Prefer to use Concept Learner when positive and negative examples of target concept available negative examples of target concept available b SUBDUE is capable of discovering lead indicators of carcinogenic/noncarcinogenic indicators of carcinogenic/noncarcinogenic activity in chemical toxicity domain. activity in chemical toxicity domain.
27
Future Work b PTE-3 Evaluation Challenge b Trimmed Data Sets (Partial Charge) b Newer Version of Concept Learning SUBDUE being developed developed
28
Reference http://cygnus.uta.edu/subdue
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.