Download presentation
Presentation is loading. Please wait.
Published byDavid Økland Modified over 5 years ago
1
Text Mining Application Programming Chapter 9 Text Categorization
Manu Konchady, 2006
3
Definition A taxonomy is a classification of organisms into groups based on similarities in structure or origin.
4
Assignment of documents to categories
5
Categorization Problem
The problem of categorization can be described as the classifications of documents into multiple categories. The n categories are predefined with specific keywords that differentiate any category from the other category. The process of identifying these keywords is called feature extraction.
6
Documents are assigned to one or more categories based on the degree of similarity with a category description. A classifier uses a similarity measure to evaluate documents against categories to find the closest category.
7
Several questions unanswered
How many categories are sufficient for the collection? What is the maximum size for a category? Are categories organized in a flat or hierarchical organization? Should documents be assigned to one or more categories?
8
In a dynamic collection, it is difficult to predict the contents of all documents that will be added to the collection. If we have too few categories or the description of a category is very general, then the size of a category can be excessive. When categories are too specific, retrieval is harder without the knowledge of specific keywords, it takes more time to find the right category. For a large set of categories, it makes sense to organize categories in a hierarchy.
9
The decision to assign a document to a category is usually made based on a measure of similarity with other documents or a set of features of the category. When the similarity measure exceeds a threshold, a document is included in the category. The threshold is one of the control parameters to create loose or tightly focused categories.
10
To seek a balance in the specificity of a category such that a category does not become too large or too small is difficult to predict beforehand for a dynamic collection. Categories are periodically adjusted to match the current state of the document collection.
11
Filter Spam Unsolicited mail Junk mail The first method to filter spam were simply a list of words that frequently occurred in spam. Free, money, click, sex, and so on. Problem:?
12
Filter spam using a list of rules
Is the from Does the body of the message contain the word money? Check subject text for the word free.
13
One of the problems with rule-based systems is that new rules must be devised to handle dynamic data.
14
Email classification process
15
Features of Spam Source domain of email
Number of non-alphanumeric characters in text Location of word features Number of recipients
16
Requirements for a spam detector
A good classifier for spam should have the following characteristics: It should be customizable The classifier must adapt to change in the environment. The process of training should be easy.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.