101035 中文信息处理 Chinese NLP Lecture 13.

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Author Name Disambiguation for Citations Using Topic and Web Correlation Citation : a collection of: coauthor, title, venue, topic, and Web attributes.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Learning for Text Categorization

Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee.

中国科学技术大学图书馆信息咨询部于明

Ch 4: Information Retrieval and Text Mining

Hinrich Schütze and Christina Lioma

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,

The Vector Space Model …and applications in Information Retrieval.

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.

Introduction to Machine Learning Approach Lecture 5.

Chapter 5: Information Retrieval and Web Search

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

English Learning Strategies Mr. Ying Liu ( 刘鹰 ) Students’ English Times 《学生双语报》

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.

Text Classification using SVM- light DSSI 2008 Jing Jiang.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

中文信息处理 Chinese NLP Lecture 9.

Text Classification, Active/Interactive learning.

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Vector Space Models.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Speaker ： Shau-Shiang Hung ( 洪紹祥 ) Adviser ： Shu-Chen Cheng ( 鄭淑真 ) Date ： 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.

海军工程大学信息安全系汇报人：周学广教授基于主题情感混合模型的无监督文本情感分析. 海军工程大学信息安全系主要内容一 LDA 模型二 UTSU 模型三实验对比与分析.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

1 CS 430: Information Discovery Lecture 5 Ranking.

半年工作小结报告人：吕小惠 2011 年 8 月 25 日. 报告提纲一．学习了 Non-negative Matrix Factorization convergence proofs 二．学习了 Sparse Non-negative Matrix Factorization 算法三．学习了线性代数中有关子空间等基础知.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Generation of Chinese Character Based on Human Vision and Prior Knowledge of Calligraphy 报告人：史操作者：史操、肖建国、贾文华、许灿辉单位：北京大学计算机科学技术研究所 NLP & CC 2012: 基于人类视觉和书法先验知识的汉字自动生成.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.

IR 6 Scoring, term weighting and the vector space model.

Presented by: Prof. Ali Jaoua

Text Categorization Rong Jin.

Text Categorization Assigning documents to a fixed set of categories

From frequency to meaning: vector space models of semantics

Presentation transcript:

101035 中文信息处理 Chinese NLP Lecture 13

应用——文本分类（1） Text Classification (1) 文本分类概况（Overview) 文本分类的用途（Applications) 文本的表示（Text representation）文本特征选择（Feature selection）

文本分类概况 Overview Definition Text classification, or text categorization, is the process of assigning a text to one or more given classes or categories. In this definition, text can be news report, technical paper, email, patent, webpage, book chapter, or a part of them. It ranges from a character or word to an entire book.

Classification System（分类系统） Text classification is mainly concerned with content-based classification. Some well-known classification systems include the Thomson Reuters Business Classification (TRBC) and Chinese Library Classification (CLC, 中图分类). In some domains, the classification system is usually manually crafted. Politics, sports, economy, entertainment, … Spam, ham Sensitive, insensitive Positive, neutral, negative …

Types of Classification Two classes (binary), one label Multiple classes, one label Multiple classes, multiple labels

? Supervised Learning Approach（有监督学习） Training documents (Labeled) Learning machine (an algorithm) Trained machine Unseen (test, query) document Labeled document

Mathematical Definition of Text Classification（数学定义） Mathematically, text classification is a mapping of unclassified text to the given classes. The mapping can be one-to-one or one- to-many. For each ( 𝑑 𝑖 , 𝑐 𝑖 )∈𝐷×𝐶 where di is a document in the document set D and ci is a class in the class set C. If the Boolean value is True, the document belongs to ci and not if otherwise. The classification model is to construct a function:

In-Class Exercise To automatically decide whether an English word is spelled correctly or not is a _____________ classification problem. A) one-class, one-label B) one-class, two-label C) two-class, one-label D) two-class, two-label

文本分类的用途 Applications Spam Filtering（垃圾邮件过滤） Genre Recognition（文体识别）

Authorship Identification（作者身份识别） Webpage Categorization（网页分类） Sentiment Analysis（情感分析）

文本的表示 Text Representation Before being applied to a learning algorithm, a target text must be properly represented. Features are used to represent the most important information in the text. N features decide the N dimensions to vectorize the text.

Text Features Characters Words N-grams Applicable to Chinese text（字） For Chinese, after word segmentation is done Many text classification applications use only word features, called the BOW (Bag-of-Words) model. N-grams N-grams are generalized words (unigrams) The bigrams of 中国人民 are (中国, 国人, 人民) Large n-grams cause data sparseness problem

Text Features POS Punctuations and Symbols Syntactic Patterns Rarely used alone Punctuations and Symbols Some of them (!, : - ) ) are effective for special text (tweet) Syntactic Patterns After syntactic parsing is done A pattern (feature) is like “NP VP PP” Semantic Patterns After semantic analysis (e.g. SRL) is done A pattern (feature) is like “Agent Target Patient Instrument”

Vector Space Model（向量空间模型） The Vector Space Model (VSM) is based on Statistics and Vector Algebra. A document is represented as a vector of features (e.g. words). Each dimension corresponds to a feature. If there are n features, a document is an n-dimensional vector. If a feature occurs in the document, its value in the vector is non- zero (known as the weight of the term, which can be binary, count or real-valued).

Binary Weights Doc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science. Features: engineering knowledge science Doc 1: 0 1 1 Doc 2: 1 0 1 The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model.

Term Frequency (TF) Weights Doc 1: Computers have brought the world to our fingertips. We will try to understand at a basic level the science – old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies. Doc 2: An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science. Features: engineering knowledge science Doc 1: 0 1 1 Doc 2: 1 0 2 The representation of a set of documents as vectors in a common vector space in known as the Vector Space Model.

Euclidean normalized tf values Term Weighting Schemes The raw tf is usually normalized by some variable related to document length to prevent a bias towards longer documents. A usual way of normalization is Euclidean Normalization. d = (d1, d2, … dn) is a vector representation of a document d in an n-dimensional vector space, the Euclidean length of d is defined to be 𝑑 2 = 𝑖=1 𝑛 𝑑 𝑖 2 Then the normalized 𝑑 = ( 𝑑 1 / || 𝑑 || 2 , 𝑑 2 / || 𝑑 || 2 , … 𝑑 𝑛 / || 𝑑 || 2 ) tf values Euclidean normalized tf values Doc 1 Doc 2 Doc 3 engineering 1 2 knowledge science 4 Doc 1 Doc 2 Doc 3 engineering 0.447 knowledge 0.707 science 0.894 Length

Term Weighting Schemes The inverse document frequency is a measure of the general importance of a term t in the document collection. The idf weight of term t is defined as follows where N is the total number of the documents in the collection, the document frequency dft is the number of documents in the collection that contain t. The tf.idf weight of a term is the product of its tf weight and its idf weight. It is one of the best known weighting schemes and used widely in NLP applications.

In-Class Exercise The following table lists the TF of 3 documents as well as the IDF for the 3 words. Compute the vectors for the 3 documents using the tf.idf weighting scheme. IDF: 0.477 0.447 0 Doc 1: 0 1 1 Doc 2: 1 0 2 Doc 3: 0 0 2 Features: engineering knowledge science

文本特征选择 Feature Selection n is usually large Motivation X={xij} n m xi y ={yj} w We need to select only a subset from all the features

Information Gain (IG, 信息增益) For feature t and class c, IG measures the information gain of t as against c in documents with t and without t: P(ci): probability of documents of class ci P(t): probability of documents with feature t P( 𝑡 ): probability of documents without feature t P(ci|t): probability of documents of class ci given that they have feature t P(ci| 𝑡 ): probability of documents of class ci given that they do not have feature t m: number of classes

Information Gain The probabilities are estimated using MLE (Maximum Likelihood Estimation, 最大似然估计). E.g., 𝑃 𝑐 𝑖 𝑡 = 𝐶𝑜𝑢𝑛𝑡(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝑐 𝑖 𝑤𝑖𝑡ℎ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡) 𝐶𝑜𝑢𝑛𝑡(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡) One advantage of IG is that it considers the contribution of a feature not occurring in the text. IG performs poorly if the class distribution and feature distribution are very unbalanced.

Mutual Information (MI, 互信息) MI measures the correlation between feature t and class c, which is defined as: or where N = A + B + C + D A B C D t ~t c ~c

Mutual Information MI is a widely used method in statistical language models. For multiple classes, we often take either the maximum or average MI: 𝑀𝐼 𝑚𝑎𝑥 𝑡 = max 1≤𝑖≤𝑚 𝑀𝐼(𝑡, 𝑐 𝑖 ) 𝑀𝐼 𝑎𝑣𝑔 𝑡 = 𝑖=1 𝑚 𝑃( 𝑐 𝑖 )𝑀𝐼(𝑡, 𝑐 𝑖 ) MI is not very effective for low-frequency features.

Chi Square (χ2, Chi方统计) A B C D t ~t c ~c χ2 measures the correlation between feature t and class c, which is defined as: where N = A + B + C + D A B C D t ~t c ~c

Chi Square For multiple classes, we often take either the maximum or average χ2: 𝜒 2 𝑚𝑎𝑥 𝑡 = max 1≤𝑖≤𝑚 𝜒 2 (𝑡, 𝑐 𝑖 ) 𝜒 2 𝑎𝑣𝑔 𝑡 = 𝑖=1 𝑚 𝑃( 𝑐 𝑖 ) 𝜒 2 (𝑡, 𝑐 𝑖 ) Unlike MI, χ2 is a normalized statistic. Like MI, χ2 is not very effective for low-frequency features.

Summary Using IG, MI, or χ2, we can select the features above a threshold (an absolute value), or a given proportion of features (e.g. 10%). Using selected features often results in lower computational cost and similar or even better performance. Experiments are needed to decide which measure is the best for a target problem.

Wrap-Up 文本分类概况文本分类的用途文本的表示文本特征选择 Definitions Classification Systems Classification Types 文本分类的用途文本的表示 Text Features Vector Space Model 文本特征选择 Information Gain Mutual Information Chi Square