FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder Lawrence B. Holder Department of Computer Science and Engineering University of Texas at Arlington
FLAIRS '992 Motivation and Goal b Ever-increasing number of chemical compounds in use today (~100,000). b Needs to identify relationships between the molecular structure and the toxicity of a chemical compound. b Apply knowledge discovery to the U.S. National Toxicology Program (NTP) to identify such relationships.
FLAIRS '993 Knowledge Discovery in SUBDUE b Structural discovery system b Graph-based input representation b Beam search through substructure (subgraph) space b Graph compression heuristic based on minimum description length b Inexact, polynomial graph match
FLAIRS '994 object triangle R1 C1 S1 S2 S3S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape T1 T2 T3T4 SUBDUE Example
FLAIRS '995 Chemical Toxicity Domain b Database of 367 chemicals b Levels of evidence assigned by NTP CE: clear evidence of cancerous activityCE: clear evidence of cancerous activity SE: some evidenceSE: some evidence E: equivocal evidenceE: equivocal evidence NE: no evidenceNE: no evidence
FLAIRS '996 Predictive Toxicology Evaluation b Predictive Toxicology Evaluation (PTE) challenge b PTE-2 ended November b PTE-3 scheduled for July July 2000
FLAIRS '997 Chemical Toxicity Data b Atoms (name, type, partial charge) b Bonds (type) b Chemical groups Alcohol, amine, amino, benzene, ester, ether, ketone, methanol, methyl, nitro, phenol and sulfideAlcohol, amine, amino, benzene, ester, ether, ketone, methanol, methyl, nitro, phenol and sulfide
FLAIRS '998 Chemical Toxicity Data b Carcinogenicity-related tests AmesAmes ChromexChromex ChromaberrChromaberr DrosophiliaDrosophilia Mouse-LymphMouse-Lymph Salmonella AssaySalmonella Assay
FLAIRS '999 Chemical Compound Representation
FLAIRS '9910 Input Representation b Sample Atomic Structure b SUDBUE graph input C H 1 v 1 atom v 2 C v 3 atom v 4 H d 1 2 name d 3 4 name u 1 3 1
FLAIRS '9911 Methodology b Training set further divided into learning and testing sets b Find best substructures in learning-set positives not prevalent in negatives b Find occurrences of substructure in testing
FLAIRS '9912 Results b b Learning set: 268 Positive compounds: 134/143 Negative compounds: 24/125 b b Testing set: 30 Positive compounds: 15/19 Negative compounds: 4/11 atom 10 c n tp atom br n tp
FLAIRS '9913 atom 10 c n tp atom 1 h n tp 0.34 atom 32 n n tp atom h n tp Results b Learning set: 268 Positive compounds: 60/143Positive compounds: 60/143 Negative compounds: 0/125Negative compounds: 0/125 b Testing set: 30 Positive compounds: 8/19Positive compounds: 8/19 Negative compounds: 0/11Negative compounds: 0/11
FLAIRS '9914 Discussion b Consistent with results obtained by ILP system PROGOL (Srinivasan et al., ILP-97). b Groups discovered by SUBDUE (e.g., Amino) are unique substructures found only in compounds which test positive on carcinogenicity.
FLAIRS '9915 Conclusion b SUBDUE has the ability to discover interesting patterns (substructures) that might be helpful in predicting carcinogenicity. b SUBDUE is suitable for knowledge discovery in the chemical toxicity domain.
FLAIRS '9916 Future Research b Applying concept-learning SUBDUE to the chemical toxicity database Find substructures compressing positive graph, but not negative graphFind substructures compressing positive graph, but not negative graph b Incorporate more domain knowledge b PTE-3 challenge (July 1999)