On the futility of attempts to formalize clustering within conventional formal frameworks Lev Goldfarb ETS group Faculty of Computer Science UNB Fredericton, Canada
2 About the talk To approach the foundations of clustering, one must rely on an adequate concept of class, which I claim is completely lacking. This most fundamental issue that concerns our area (understood broadly to include machine learning) has been systematically neglected, putting any progress in the area in question. So … what is a class? In particular: Can the concept of class be adequately addressed within conventional math/CS formalisms ? As the title of the talk suggests, the answer is “no”. (For a radically new representational formalism, ETS, not considered here, see references in the abstract of this talk.)
3 What is a numeric representation? Classical measurement is a systematic method for representing objects by numbers. (Classes of objects do not at all enter into consideration.) Natural numbers (Peano representation) (via fixed measurable property) Physical objects object representation map Restricting ourselves to natural numbers:
4 What is a representational formalism? class Representational formalism class Physical objects class representation map class the two mappings are coupled object representation map I. The two representation mappings should not be decoupled. II. We should postulate that all classes have “inductive generative structure”.
5 What is a representational formalism? I. Formal implications of the tight link between the two mappings ( for objects & classes ) a)for the interpretation of basic operations in the chosen formalism: we should treat them as object operations and hence take them seriously, in contrast to present practice in applied mathematics b)for the general structure of class representation: a class representation must be expressed via basic operations in modern mathematics, this is a standard structural requirement (ignored in ML) !
Lev Goldfarb, NIPS 2005, Clustering6 What is a representational formalism? II.Refinement of the general structure of class representation [ I b) ]. Relying on our understanding of the structure of classes in nature, the above “inductive generative structure” of classes should mean that class representation must be a)of generative form: it must incorporate the mechanism by means of which the members of the class are constructed via the basic operations (also a standard structural math. requirement) and b)inductive: it must be effectively and reliably learnable from a very small training set.
Lev Goldfarb, NIPS 2005, Clustering7 Inadequacy of formal grammars Grammars do not offer an “inductive” class representation [ II b) ]. The main reason: a string over a finite alphabet does not carry within itself enough representational information to link it “effectively and reliably” with the corresponding grammar, i.e. to identify the class to which it belongs (see also the next slide). Thus, the overall deficiency of formal grammars is twofold: poor object representation class representation is “disconnected” from object representation e.g. nonterminals are not derivable from the object representation
8 Inadequacy of the string representation: there are better choices contexts Two of the possible formative histories for string abaca : An ETS representation (has nothing to do with a tree; captures the temporal sequence of insertions):
Lev Goldfarb, NIPS 2005, Clustering9 The vector space as a representational formalism However, the overwhelming practice amounts to: “take the vectors and run”, i.e. do what you want with them. When modeling various phenomena in science, classes have not yet become the focus of attention, hence it is up to us to address these new scientific representational issues. the only candidate for class/class representation is the affine subspace from above vector space representation basic operations are {+, ·}
10 Inadequacy of the vector space formalism Obviously, it lacks generative [ II a) ] class representation. Why? The absence of “sufficient” representational structure results in: operations {+, ·} being too “simple” linear generativity producing only very “regular” classes. To compensate, a class description had to be brought in from outside the algebraic formalism proper (which again violates the standard “structural” wisdom of mathematics). The resulting class description: is structurally and representationally “alien” and “meaningless” (there is no tight link between an object and its class representation) includes non-class vectors that satisfy the class description
Lev Goldfarb, NIPS 2005, Clustering11 Inadequacy of the vector space formalism Unfortunately, the prevailing trend in machine learning is that clever distance measures or kernels should “solve the problem”. However, these have to be crafted manually, and, more importantly: they cannot rectify the inadequacy of a vector as an object representation again, they are being brought in from “outside” the representational (algebraic) formalism.
Lev Goldfarb, NIPS 2005, Clustering12 Inadequacy of the vector space formalism Thus, ML practice reinforces the scientifically counterproductive view that classes are our creation, rather than existing in nature (due to the fact that class representation is not “related” to object representation). On the other hand, once we develop a formalism in which the concept of class follows the “structural” mathematical wisdom, we would then offer the sciences a formal language of inestimable value, i.e. something that mathematics has traditionally provided.
Lev Goldfarb, NIPS 2005, Clustering13 Conclusion No adequate class representation No foundation for clustering The golden age of classification (and “clustering”) is still ahead of us, though its arrival depends on the development of the “right” representational formalism.