Classification in Complex Systems Why we should look at the paper: CAEP: Classification by Aggregating Emerging Patterns G. Dong, X. Zhang, L. Wong, and J Li
What are Common Problems in Classification? Many variables Graphs that relate tuples Protein-protein interactions (KDD-cup 02) Citations (KDD-cup 03) Anything that violates standard table format
Many Variables Solution: Naïve Bayes way of multiplying probabilities Other additive models Problems: Many factors May be correlated Noise … but it gets worse
Graphs 2 kinds of attributes How do neighbor attributes count? Attributes within nodes Attributes of neighbor and more distant nodes How do neighbor attributes count? Take disjunction? “At least one neighbor that has a particular property” Probably preferable: Use links or, more general, paths as basis Integration into classification???
Idea Get away from strict set of n attributes If an attribute or combination of attributes is “interesting” use them Combining rules? I would have guessed as in Naïve Bayes CAEP adds probabilities!?
What is “interesting” CAEP paper claims “growth rate” Support of a rule increases significantly from one class label to another Note: Only increase, not decrease! What does that mean? For pattern e and classes P and N growth_ratePN (e) = suppN (e) / suppP (e)
2 Things Worth Investigating Is “interestingness” measure related to information gain? Under certain assumptions: Yes Can the “score” be justified? Sum of P(C)!?
Other Issues Normalization How to mine for EPs Emerging patterns only consider increase in support => different number of relevant patterns How to mine for EPs
Conclusions Idea very valuable Justification of details? Classification split into ARM-step and rule combination Justification of details? Not great Should be possible to do it right – with poorer accuracy ;-)