Download presentation
Presentation is loading. Please wait.
1
Text Categorization Rong Jin
2
Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education
3
Yahoo Shopping Categories
4
Spam Filtering Two categories: spam or ham
Automatically decide the category for each incoming
5
Text Categorization in IR
Many search engine functions are based TC Language identification (English vs. French etc.) Detecting spam pages (spam vs. nonspam) Detecting sexually explicit content (sexually explicit vs. not) Sentiment detection: positive or negative review Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)
6
Text Categorization (TC)
Given: A fixed set of categories C = {c1, c2, , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data)
7
Text Categorization (TC)
Given: A fixed set of categories C = {c1, c2, , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data) Predict the categories for new documents (i.e., test documents)
8
Text Categorization (TC)
Given Prediction
9
K Nearest Neighbor Classifier
10
K-Nearest Neighbor Classifier
Keep all training examples Find k examples that are most similar to the new document (“nearest neighbor” documents) Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category) (k=1) (k=4)
11
K-Nearest Neighbor Classifier
Implementation issue Searching the nearest neighbors could be time consuming when the number of training documents is large Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2
12
K-Nearest Neighbor Classifier
Large K Small variance: prediction is less sensitive to the given set of training documents Large bias: prediction is less sensitive to the document content (k=1) (k=4)
13
K-Nearest Neighbor Classifier
Small K Large variance: prediction is sensitive to the given set of training documents Small bias: prediction is sensitive to the document content (k=1) (k=4)
14
K-Nearest Neighbor Classifier
Cross validation to determine K Split labeled documents into training set (80%) and validation set (20%) For each K in a given range Predict the categories for docs in the validation set using the documents in the training set Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) Choose K with the smallest classification error
15
Cross Validation for K K=1, error = 10 K=2, error = 5 K=3, error = 2
Choose K= 3 20% 80% Predict Validation Set Training Set
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.