Text Categorization Rong Jin.

Text Categorization Rong Jin

Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

Yahoo Shopping Categories

Spam Filtering Two categories: spam or ham
Automatically decide the category for each incoming

Text Categorization in IR
Many search engine functions are based TC Language identification (English vs. French etc.) Detecting spam pages (spam vs. nonspam) Detecting sexually explicit content (sexually explicit vs. not) Sentiment detection: positive or negative review Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)

Text Categorization (TC)
Given: A fixed set of categories C = {c1, c2, , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data)

Given: A fixed set of categories C = {c1, c2, , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data) Predict the categories for new documents (i.e., test documents)

Given Prediction

K Nearest Neighbor Classifier

K-Nearest Neighbor Classifier
Keep all training examples Find k examples that are most similar to the new document (“nearest neighbor” documents) Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category) (k=1) (k=4)

Implementation issue Searching the nearest neighbors could be time consuming when the number of training documents is large Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2

Large K Small variance: prediction is less sensitive to the given set of training documents Large bias: prediction is less sensitive to the document content (k=1) (k=4)

Small K Large variance: prediction is sensitive to the given set of training documents Small bias: prediction is sensitive to the document content (k=1) (k=4)

Cross validation to determine K Split labeled documents into training set (80%) and validation set (20%) For each K in a given range Predict the categories for docs in the validation set using the documents in the training set Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) Choose K with the smallest classification error

Cross Validation for K K=1, error = 10 K=2, error = 5 K=3, error = 2
Choose K= 3 20% 80% Predict Validation Set Training Set

Text Categorization Rong Jin.

Similar presentations

Presentation on theme: "Text Categorization Rong Jin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Categorization Rong Jin.

Similar presentations

Presentation on theme: "Text Categorization Rong Jin."— Presentation transcript:

Similar presentations

About project

Feedback