Text Categorization Rong Jin
Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education
Yahoo Shopping Categories
Spam Filtering Two categories: spam or ham Automatically decide the category for each incoming email
Text Categorization in IR Many search engine functions are based TC Language identification (English vs. French etc.) Detecting spam pages (spam vs. nonspam) Detecting sexually explicit content (sexually explicit vs. not) Sentiment detection: positive or negative review Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)
Text Categorization (TC) Given: A fixed set of categories C = {c1, c2, . . . , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data)
Text Categorization (TC) Given: A fixed set of categories C = {c1, c2, . . . , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data) Predict the categories for new documents (i.e., test documents)
Text Categorization (TC) Given Prediction
K Nearest Neighbor Classifier
K-Nearest Neighbor Classifier Keep all training examples Find k examples that are most similar to the new document (“nearest neighbor” documents) Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category) (k=1) (k=4)
K-Nearest Neighbor Classifier Implementation issue Searching the nearest neighbors could be time consuming when the number of training documents is large Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2
K-Nearest Neighbor Classifier Large K Small variance: prediction is less sensitive to the given set of training documents Large bias: prediction is less sensitive to the document content (k=1) (k=4)
K-Nearest Neighbor Classifier Small K Large variance: prediction is sensitive to the given set of training documents Small bias: prediction is sensitive to the document content (k=1) (k=4)
K-Nearest Neighbor Classifier Cross validation to determine K Split labeled documents into training set (80%) and validation set (20%) For each K in a given range Predict the categories for docs in the validation set using the documents in the training set Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) Choose K with the smallest classification error
Cross Validation for K K=1, error = 10 K=2, error = 5 K=3, error = 2 ------------------------------ Choose K= 3 20% 80% Predict Validation Set Training Set