Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

Similar presentations


Presentation on theme: "Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81."— Presentation transcript:

1 Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81

2 2 Preview  Ⅳ.1 Applications of text categorization  Ⅳ.1.1 Indexing of Texts Using Controlled Vocabulary  Ⅳ.1.2 Document Sorting and Text Filtering  Ⅳ.1.3 Hierarchical Web Page Categorization  Ⅳ.2 Definition of the problem  Ⅳ.2.1 Single-Label vs Multilabel Categorization  Ⅳ.2.2 Document-Pivoted vs Category-Pivoted Categorization  Ⅳ.2.3 Hard vs Soft Categorization  Ⅳ.3 Document representation  Ⅳ.3.1 Feature selection  Ⅳ.3.2 Dimensionality reduction by feature extraction  Ⅳ.4 Knowledge engineering approach to TC

3 3 Ⅳ.1 Applications of text categorization  Text Categorization  To classify a given data instance into a prespecified set of categories  Given a set of categories ( subjects, topics ) and a collection of text documents, the process of finding the correct topic for each document.  Two main approaches to text categorization  The knowledge engineering approach  The machine learning approach → most of the recent work on categorization  The knowledge acquisition bottleneck  The huge amount of highly skilled labor and expert knowledge required to create and maintain the knowledge-encoding rules  Three common TC applications  Text indexing  Document sorting and text filtering  Web page categorization

4 4 Ⅳ.1.1 Indexing of Texts Using Controlled Vocabulary  Controlled Vocabulary  The key terms all belong to a finite set  Ex ) the NASA aerospace thesaurus, the MESH thesaurus for medicine  Text Indexing  The task of assigning keywords From a controlled vocabulary to text documents  keywords → categories ⇒ text indexing → text categorization  Typically  Keywords : at least one, and not more than k  either fully automatically or semiautomatically

5 5 Ⅳ.1.2 Document Sorting and Text Filtering(1/2)  Sorting the given collection of documents into several “bins”  Ex) in a newspapers, the classified ads -Personal / Car Sale / Real Estate / and so on.  Ex) e-mail -Complaints / Deals / Job applications / others  Document sorting features  Each document belong to exactly one category  Text filtering  Document sorting with only two bins – the “relevant” and “irrelevant”  Ex) An e-mail client should filter away spam.

6 6 Ⅳ.1.2 Document Sorting and Text Filtering(2/2)  For many of the filtering tasks,  The recall errors are much more costly than precision errors.  Ex) recall error : an important letter is considered spam → missing from “good document” category precision error : some of the spam still passes through → the “good documents” category contains some extra letters * Recall errors - when a category is missing some document that should have been assigned to it * Precision errors - when a category includes documents that should not belong to it

7 7 Ⅳ.1.3 Hierarchical Web Page Categorization  The automatic classification of Web pages under the hierarchical catalogues posted by popular Internet Portals  Such catalogues are very useful  For direct browsing  For restricting the query-based search to pages belonging to a particular topic  Constrains the number of documents belonging to a particular category → to prevent the categories from becoming excessively large.  Whenever the number of document in a category exceeds k → it should be split into two or more subcategories → the categorization system must support adding new categories and deleting obsolete ones

8 8 Ⅳ.2 Definition of the problem  Definition of the text categorization D : the set of all possible documents C : the set of predefined categories F(d,c) = 1, if the document d belongs to the category c 0, otherwise  The approximating function is called a classifier.  to build a classifier  Produces results as “close” as possible to the true category assignment function F.

9 9 Ⅳ.2.1 Single-Label vs Multilabel Categorization  Depending the properties of F  In multilabel categorization  The categories overlap  A document may belong to any number of categories  | C | : the number of categories  | C | binary classifiers provided the decisions to assign a document to different categories  In single-label categorization  Each document belongs to exactly one category  Ex) binary categorization -The number of categories is two -The simplest, most common, most often used for the demonstration of categorization techniques

10 10 Ⅳ.2.2 Document-Pivoted vs Category-Pivoted Categorization  Document-pivoted categorization  Given a document -The classifier finds all categories to which the document belongs  Category-pivoted categorization  To find all documents that should be filed under a given category  The difference is significant  not all documents or not all categories are immediately available in Ex) The documents come in one-by-one → document-pivoted categorization Ex) if the categories set is not fixed, and if the documents need to be reclassified → category-pivoted categorization

11 11 Ⅳ.2.3 Hard vs Soft Categorization  The hard categorization  A fully automated categorization system make a binary decision on each document-category pair  The level of performance may be insufficient for some applications  The soft ( or ranking ) categorization  A semiautomated approach is appropriate in which the decision to assign a document to a category is made by a human for whom the TC system provides a list of categories arranged by the system’s estimated appropriateness of the category for the document  A categorization status value ( CSV )  Many classifiers actually have the whole segment [0,1] as their range -They produce a real value between zero and one -One for each document-category pair

12 12 Ⅳ.3 Document representation (1/2)  A preprocessing step  The documents are converted into a more manageable representation  Feature vector  The documents are represented by feature vector  A document is represented as a vector in feature space  a sequence of feature and their weights  A bag-of-words model  Use all words in a document as the features  The dimension of the feature space = the number of different words in all of the document

13 13 Ⅳ.3 Document representation (2/2)  The methods of giving weights to the features may vary  The simplest : binary -If the corresponding word is present in document – 1 -Otherwise – 0  More complex weighting schemes -Take into account the frequencies of the word in the document, category, whole collection -TF – IDF scheme -TermFreq( w, d ) : the frequency of the word in the document N : the document of all documents DocFreq(w) : the number of documents containing the word w

14 14 Ⅳ.3.1 Feature selection ( 1/3 )  The number of different words is large  The dimension of the bag-of-words feature space  The document representation vectors – sparse  Most of those words are irrelevant to the categorization task  Feature selection  Remove the irrelevant words  Most TC systems  At least remove the stop words  Perform a much more aggressive filtering, removing 90 to 99 percent of features

15 15 Ⅳ.3.1 Feature selection ( 2/3 )  In order to perform the filtering → a measure of the relevance of each feature needs to be defined  The simplest : the document frequency DocFreq(w)  Experimental evidence -Using only the top 10 percent of the most frequent words -does not reduce the performance of classifiers -This seems to contradict the well-known “law” of IR : the terms with low-to-medium document frequency are the most imformative → there is no contradiction! because the large majority of all words have a very low document frequency, and the top 10 percent do contain all low-to-medium frequency words.

16 16  More sophisticated measure of feature relevance  Take into account the relations between features and the categories  Ex1) the information gain  Measures the number of bits of information obtained for the prediction of categories by knowing the presence or absence in a document of the feature f  Ex2) the chi-square  Measure the maximal strength of dependence between the feature and the categories Ⅳ.3.1 Feature selection ( 3/3 )

17 17 Ⅳ.3.2 Dimensionality reduction by feature extraction  Another way of reducing the number of dimensions → to create a new, much smaller set of synthetic features from the original feature set  In effect,  this amounts to creating a transformation from the original feature space to another space of much lower dimension  The rationale for using synthetic features rather than naturally occurring words is that, owing to polysemy, homonymy, and synonymy, the words may not be the optimal features

18 18 Ⅳ.4 Knowledge engineering approach to TC(1/2)  The knowledge engineering approach to TC is focused around manual development of classification rules  CONSTRUE system  A single example of the knowledge engineering approach to the TC  A typical rule in the CONSTRUE system is as follows  Such rule may look like the following

19 19 Ⅳ.4 Knowledge engineering approach to TC(2/2)  The system was reported to produce a 90-percent breakeven between precision and recall ( 723 document )  It is unclear  whether the particular chosen test collection influenced the results  whether the system would scale up,  but such excellent performance has not yet been unattained by machine learning systems  Knowledge acquisition bottleneck  Makes the machine learning approach attractive despite possible somewhat lower quality results


Download ppt "Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81."

Similar presentations


Ads by Google