Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia www.cs.uga.edu/~miller/SemWeb.

Slides:

Advertisements

Similar presentations

Is Random Model Better? -On its accuracy and efficiency-

Advertisements

Text Categorization.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Chapter 5: Introduction to Information Retrieval

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.

Unsupervised learning

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

Vector Space Model CS 652 Information Extraction and Integration.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

The identification of interesting web sites Presented by Xiaoshu Cai.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Basic Data Mining Technique

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Classification Techniques: Bayesian Classification

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

K-Nearest Neighbor Learning.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Information Organization: Overview

Chapter 7. Classification and Prediction

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Classification Techniques: Bayesian Classification

Text Categorization Assigning documents to a fixed set of categories

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Text Categorization Berlin Chen 2003 Reference:

Information Retrieval

Information Organization: Overview

Presentation transcript:

Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia

2 Query to General-Purpose Search Engine: +camp +basketball “north carolina” “two weeks” Automatic Text Classification through Machine Learning, McCallum, et. al.

3 Domain-Specific Search Engine Automatic Text Classification through Machine Learning, McCallum, et. al.

4 Automatic Text Classification through Machine Learning, McCallum, et. al.

5

6 Domain-Specific Search Engine Advantages High precision. Powerful searches on domain-specific features. –by location, time, price, institution. Domain-specific presentation interfaces: –Topic hierarchies. –Specific fields shown in clear format. –Links for special relationships. Automatic Text Classification through Machine Learning, McCallum, et. al.

7 Domain-Specific Search Engine Disadvantages Much human effort to build and maintain! –e.g. Yahoo has hired many people to build their hierarchy, and maintain “Full Coverage”, etc. Automatic Text Classification through Machine Learning, McCallum, et. al.

8 Tough Tasks Find pages that belong in the search engine. Find specific fields (price, location, etc). Organize the content for browsing. Automatic Text Classification through Machine Learning, McCallum, et. al.

9 Machine Learning to the Rescue! Find pages that belong in the search engine. –Efficient spidering by reinforcement learning. Find specific fields (price, location, etc). –Information extraction with hidden Markov models. Organize the content for browsing. –Populate a topic hierarchy by document classification. Automatic Text Classification through Machine Learning, McCallum, et. al.

10 Building Text Classifiers Manual approach –Interactive query refinement –Expert system methodologies Supervised learning 1. “Expert” labels example texts with classes 2. Machine learning algorithm produces rule that tends to agree with expert classifications Machine Learning for Text Classification, David D. Lewis, AT&T Labs

11 Advantages of Using Machine Learning to Build Classifiers Requires no linguistic or computer skills Competitive with manual rule-writing Forces good practices –Looking at data –Estimating accuracy Can be combined with manual engineering –ML research pays too little attention to this Machine Learning for Text Classification, David D. Lewis, AT&T Labs

12 Main Processes for a Machine-Learning System Prepare training samples Feature selection Text representation Selected features Model induction Feature vectors Selected features Profiles /Rules Supervised Machine-Learning Based Text Categorization, Ng Hong I

13 Preparation of Training Texts Essential for a supervised machine learning text categorization system Decide on the set of categories A set of positive training texts is prepared for each of the categories Assign subject code(s) to each of the training texts –More than one subject code may be assigned to one training text Supervised Machine-Learning Based Text Categorization, Ng Hong I

14 Demonstration System: Cora Find pages that belong in the search engine. –Spider CS departments for research papers. Find specific fields (price, location, etc). –Extract titles, authors, abstracts, institutions, etc from paper headers and references. Organize the content for browsing. –Populate a hand-built topic hierarchy by using text classification. Automatic Text Classification through Machine Learning, McCallum, et. al.

15 Automatic Text Classification through Machine Learning, McCallum, et. al.

16 Automatic Text Classification through Machine Learning, McCallum, et. al.

17 See also CiteSeer [Bollacker, Lawrence & Giles ‘98] Automatic Text Classification through Machine Learning, McCallum, et. al.

18 Automatic Text Classification through Machine Learning, McCallum, et. al.

19 Automatic Text Classification via Statistical Methods Text Categorization is the problem of assigning predefined categories to free text documents. Popular Approach is Statistical Learning Methods Bayes Method Rocchio Method (most popular) Decision Trees K-Nearest Neighbor Classification Support Vector Machines (fairly new concept)

20 A Probabilistic Generative Model Define a probabilistic generative model for documents with classes. Bayes: Reinforcement Learning: a Survey This paper surveys the field of reinforcement learning from a computer science perspective. 35 a 1 block 12 computer 4 field 1 leg 7 machine 44 of 3 paper 2 perspective 1 rate 5 reinforcement 9 science 2 survey 56 the 11 this 1 underrated … “Bag-of-words” Automatic Text Classification through Machine Learning, McCallum, et. al.

21 Bayes Method Pick the most probable class, given the evidence: - a class (like “Planning”) - a document (like “language intelligence proof...”) Bayes Rule: Probability Category c j should be assigned to document d Automatic Text Classification through Machine Learning, McCallum, et. al.

22 Bayes Rule - Probability that document d belongs to category c j - Probability that a randomly picked document has the same attributes - Probability that a randomly picked document belongs to this category - Probability that category c contains document d

23 Bayes Method Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category. Larger vocabulary generate better probabilities Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification. Documents may fall into one, more than one, or not even one category.

24 Rocchio Method Each document is D is represented as a vector within a given vector space V: Documents with similar content have similar vectors Each dimension of the vector space represents a word selected via a feature selection process

25 Rocchio Method Values of d (i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w) TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

26 Rocchio Method The inverse document frequency is calculated as Value of d (i) of feature w i for a document d is calculated as the product d (i) is called the weight of the word w i in the document d.

27 Rocchio Method Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document However, words that occurs frequently in many document spanning many categories are rated less importantly

28 Decision Tree Learning Algorithm Probabilistic methods have been criticized since they are not easily interpreted by humans, not so with Decision Trees Decision Trees fall into the category of symbolic (non-numeric) algorithms

29 Decision Trees Internal nodes are labeled by terms Branches (departing from a node) are labeled by tests on the weight that the term has in a test document Leafs are labeled by categories

30 Decision Tree Example

31 Decision Tree Classifier categorizes a test document d by recursively testing for the weights that the terms labeling the internal nodes have until a leaf node is reached. The label of the leaf node is then assigned to the document Most decision trees are binary trees

32 Decision Tree Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents –Therefore, pruning and growing methods for such Decision Trees are normally standard part of the classification packages

33 K-Nearest Neighbor Features –All instances correspond to points in an n- dimensional Euclidean space –Classification is delayed till a new instance arrives –Classification done by comparing feature vectors of the different points –Target function may be discrete or real- valued K-Nearest Neighbor Learning, Dipanjan Chakraborty

34 1-Nearest Neighbor K-Nearest Neighbor Learning, Dipanjan Chakraborty

35 K-Nearest Neighbor An arbitrary instance is represented by (a 1 (x), a 2 (x), a 3 (x),.., a n (x)) –a i (x) denotes features Euclidean distance between two instances d(x i, x j )=sqrt (sum for r=1 to n (a r (x i ) - a r (x j )) 2 ) Find the k-nearest neighbors whose distance from your test cases falls within a threshold p. If x of those k-nearest neighbors are in category c i, then assign the test case to c i, else it is unmatched. K-Nearest Neighbor Learning, Dipanjan Chakraborty

36 Support Vector Machines Based on the Structural Risk Minimization principle form computational learning theory –Find a hypothesis h for which we can guarantee the lowest true error The true error of h is the probability that h will make an error on an unseen and randomly selected test example

37 Evaluating Learning Algorithms and Software How effective/accurate is classification? Compatibility with operational environment Resource usage Persistence Areas learning algorithms need improvement Machine Learning for Text Classification, David D. Lewis, AT&T Labs

38 Effectiveness: Contingency Table Machine Learning for Text Classification, David D. Lewis, AT&T Labs

39 recall = a/(a+c) precision = a/(a+b) accuracy = (a+c)/(a+b+c+d) utility = any weighted average of a,b,c,d F-measure = 2a/(2a+b+c) others Effectiveness Measures Machine Learning for Text Classification, David D. Lewis, AT&T Labs

40 Effectiveness: How to Predict Theoretical gaurantees rarely useful Test system on manually classified data –Representativeness of sample important –Will data vary over time? –Effectiveness varies widely across classes and data sets Interindexer agreement an upper bound? Machine Learning for Text Classification, David D. Lewis, AT&T Labs

41 Effectiveness: How to Improve More training data Better training data Better text representation –Usual IR tricks (term weighting, etc.) –Manually construct good predictor features e.g. % capitalized letters for spam filtering Hand off hard cases to human being Machine Learning for Text Classification, David D. Lewis, AT&T Labs

42 Conclusions Performance of classifier depends strongly on the choice of data used for evaluation. Dense category space become problematic for unique categorization, many documents share characteristics

43 Credits *This Presentation is Partially Based on Those of Others Listed Below* Supervised Machine Learning Based Text CategorizationSupervised Machine Learning Based Text Categorization Machine Learning for Text Classification Automatically Building Internet Portals using Machine LearningAutomatically Building Internet Portals using Machine Learning Web Search Machine Learning K-Nearest Neighbor Learning Full Presentations can be found at:

44 Resources Text Categorization Using Weight Adjusted k-Nearest Neighbor ClassificationText Categorization Using Weight Adjusted k-Nearest Neighbor Classification A Probalisitic Analysis of the Rocchio Alg. w/ TFIDF for Text CategorizationA Probalisitic Analysis of the Rocchio Alg. w/ TFIDF for Text Categorization Text Categorization w/ Support Vector Machines Learning to Extract Symbolic Knowledge from the WWWLearning to Extract Symbolic Knowledge from the WWW An Evaluation of Statistical Approaches to Text CategorizationAn Evaluation of Statistical Approaches to Text Categorization A Comparison of Two Learning Algorithms for Text CategorizationA Comparison of Two Learning Algorithms for Text Categorization Machine Learning in Automated Text Categorization Full List of Resources can be found at: