Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota, Department of Computer Science/Army HPC Research Center Teacher : Dr.Ynag Student : Gun-Ren Wang Minneapolis, MN 55455 Technical Report #03-016

Outline 1.Introduction 2.Frequent Subgraph Based Classification Framework 3.Feature Generation Feature Generation 4.Feature Selection 5.Conclusion

Introduction Any new drug should not only produce the desired response to the disease, but should do so with minimal side effects. Evaluating this large set of compounds using HTS can be prohibitively expensive. Not all biological assays can be converted to high throughput format. Studying what part of the chemical compound leads to desirable behavior.

Frequent Subgraph Based Classification Framework

Feature Generation In our classification algorithm we find the frequently occurring subgraphs using the FSG algorithm. Topological sub-structures capture the connectivity of atoms in the chemical compound but they ignore the 3D shape of the sub-structures.

Adjacency-list representation

Canonical Labeling

Candidate Joining

Candidate Generation(1)

Candidate Generation(2)

Feature Selection For example,we have two ruleitems that have the same condset:. Assume the support count of the condset is 3. (assume |D| = 10): (A, 1), (B, 1)(class, 1) [supt = 20%, confd= 66.7%] we only produce one PR(possible rule)

The CBA-RG algorithm

Building a Classifier Definition: Given two rules, r and r < r (also called r precedes rj or ri has a higher precedence than rj) if 1. the confidence of ri is greater than that of rj, or 2. their confidences are the same, but the support of ri is greater than that of rj, or 3.both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj;

A naïve algorithm for CBA-CB: M1

Experimental Methodology & Metrics Table 1: The characteristics of the various datasets. N is the number of compounds in the database. ¯ NA and ¯ NB are the average number of atoms and bonds in each compound. ¯ L A and ¯ L B are the average number of atom- and bond-types in each dataset. max NA/min NA and max NB/min NB are the maximum/minimum number of atoms and bonds over all the compounds in each dataset.

Varying Minimum Support

Conclusion In this paper we presented a highly-effective algorithm for classifying chemical compounds based on frequent substructure discovery that can scale to large datasets.

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

Similar presentations

Presentation on theme: "Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,

Similar presentations

Presentation on theme: "Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,"— Presentation transcript:

Similar presentations

About project

Feedback