Text Mining Application Programming Chapter 9 Text Categorization Manu Konchady, 2006
Definition A taxonomy is a classification of organisms into groups based on similarities in structure or origin.
Assignment of documents to categories
Categorization Problem The problem of categorization can be described as the classifications of documents into multiple categories. The n categories are predefined with specific keywords that differentiate any category from the other category. The process of identifying these keywords is called feature extraction.
Documents are assigned to one or more categories based on the degree of similarity with a category description. A classifier uses a similarity measure to evaluate documents against categories to find the closest category.
Several questions unanswered How many categories are sufficient for the collection? What is the maximum size for a category? Are categories organized in a flat or hierarchical organization? Should documents be assigned to one or more categories?
In a dynamic collection, it is difficult to predict the contents of all documents that will be added to the collection. If we have too few categories or the description of a category is very general, then the size of a category can be excessive. When categories are too specific, retrieval is harder without the knowledge of specific keywords, it takes more time to find the right category. For a large set of categories, it makes sense to organize categories in a hierarchy.
The decision to assign a document to a category is usually made based on a measure of similarity with other documents or a set of features of the category. When the similarity measure exceeds a threshold, a document is included in the category. The threshold is one of the control parameters to create loose or tightly focused categories.
To seek a balance in the specificity of a category such that a category does not become too large or too small is difficult to predict beforehand for a dynamic collection. Categories are periodically adjusted to match the current state of the document collection.
Filter Email Spam Unsolicited mail Junk mail The first method to filter spam were simply a list of words that frequently occurred in spam. Free, money, click, sex, and so on. Problem:?
Filter spam using a list of rules Is the email from someone@spam.com? Does the body of the message contain the word money? Check subject text for the word free.
One of the problems with rule-based systems is that new rules must be devised to handle dynamic data.
Email classification process
Features of Spam Source domain of email Number of non-alphanumeric characters in email text Location of word features Number of email recipients
Requirements for a spam detector A good classifier for spam should have the following characteristics: It should be customizable The classifier must adapt to change in the environment. The process of training should be easy.