Text Categorization Rong Jin.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
K nearest neighbor and Rocchio algorithm
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
Overview of Search Engines
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Anomaly detection Problem motivation Machine Learning.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Chapter 9 – Classification and Regression Trees
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Spam Detection Ethan Grefe December 13, 2013.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
COLLABORATIVE CLASSIFIER AGENTS Studying the Impact of Learning in Distributed Document Classification Weimao Ke, Javed Mostafa, and Yueyu Fu {wke, jm,
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ORGANIZING . 1.Sort messages quickly. 2.Group similar messages in folders or labels. 3.Route mail efficiently to specific folders or labels. 4.Reduce.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
School of Computer Science & Engineering
Information Retrieval
Data Mining 101 with Scikit-Learn
Introduction to Data Science Lecture 7 Machine Learning Overview
MID-SEM REVIEW.
Lecture 15: Text Classification & Naive Bayes
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
What is Pattern Recognition?
Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin.
Collaborative Filtering Nearest Neighbor Approach
Classification Nearest Neighbor
Prepared by: Mahmoud Rafeek Al-Farra
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Web Mining Department of Computer Science and Engg.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining Classification: Alternative Techniques
Text Categorization Berlin Chen 2003 Reference:
Categorization: Information and Misinformation
Information Retrieval
Dynamic Category Profiling for Text Filtering and Classification
Basics of ML Rohan Suri.
CSE4334/5334 Data Mining Lecture 7: Classification (4)
MIS2502: Data Analytics Classification Using Decision Trees
Information Organization: Overview
Jia-Bin Huang Virginia Tech
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Active AI Projects at WIPO
Presentation transcript:

Text Categorization Rong Jin

Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

Yahoo Shopping Categories

Spam Filtering Two categories: spam or ham Automatically decide the category for each incoming email

Text Categorization in IR Many search engine functions are based TC Language identification (English vs. French etc.) Detecting spam pages (spam vs. nonspam) Detecting sexually explicit content (sexually explicit vs. not) Sentiment detection: positive or negative review Vertical search – restrict search to a “vertical” like “related to health” (relevant to vertical vs. not)

Text Categorization (TC) Given: A fixed set of categories C = {c1, c2, . . . , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data)

Text Categorization (TC) Given: A fixed set of categories C = {c1, c2, . . . , cJ} The categories are human-defined for the needs of an application (e.g., spam vs. non-spam). A set of labeled documents (i.e., training data) Predict the categories for new documents (i.e., test documents)

Text Categorization (TC) Given Prediction

K Nearest Neighbor Classifier

K-Nearest Neighbor Classifier Keep all training examples Find k examples that are most similar to the new document (“nearest neighbor” documents) Assign the category that is most common in these nearest neighbor documents (neighbors vote for the category) (k=1) (k=4)

K-Nearest Neighbor Classifier Implementation issue Searching the nearest neighbors could be time consuming when the number of training documents is large Improve the efficiency by text search engines Test Doc Training Docs + Class labels Index Database Search Engine D1 (C1) D113 (C2) D1001 (C2) C2

K-Nearest Neighbor Classifier Large K Small variance: prediction is less sensitive to the given set of training documents Large bias: prediction is less sensitive to the document content (k=1) (k=4)

K-Nearest Neighbor Classifier Small K Large variance: prediction is sensitive to the given set of training documents Small bias: prediction is sensitive to the document content (k=1) (k=4)

K-Nearest Neighbor Classifier Cross validation to determine K Split labeled documents into training set (80%) and validation set (20%) For each K in a given range Predict the categories for docs in the validation set using the documents in the training set Compute the classification error (i.e. percentage of documents in the validation set that are misclassified) Choose K with the smallest classification error

Cross Validation for K K=1, error = 10 K=2, error = 5 K=3, error = 2 ------------------------------ Choose K= 3 20% 80% Predict Validation Set Training Set