Download presentation
Presentation is loading. Please wait.
1
Implementing Neural Networks for Text Classification: Data Sets Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo
2
Data Set Selection There are two types of Data Sets that can be used: –Compilation of documents from web, etc manually specifically for this project –Use of an existing Data Set that has been worked on by other researchers
3
Advantages of Standard Data Sets We don’t have to work for obtaining the data Distribution of documents in the corpora used is even. Further, documents are well-classified Comparison of results can be done with results from other researchers. This gives a comparative evaluation of the algorithm being used for classification.
4
Most popular corpora Most popular corpora used for text-classification research are: –Reuters-21578 data set (set of 21,578 newswire articles from Reuters – available as SGML documents – 1000 documents in each file) –20-newsgroups data (a set of 20,000 newsgroup postings from 20 newsgroups – available as text files – one document per file) –WebKB database (web pages from 4 universities class)
5
Reuters-21578 data set Data is classified into five groups of classes: Category set Number of categories Number of categories with 1+ occurrences Number of categories with 20+ occurrences EXCHANGES 39327 ORG 56329 PEOPLE 26711414 PLACES 17514760 TOPICS 13512047 TOTAL672445137
6
Reuters-21578 data set Categories are overlapping and non-exhaustive. Overlapping: one document can be classified into more than one categories. E.g. a document can be about ‘nasdaq’ (EXCHANGES) and about ‘USA’ (PLACES) in general. Non-exhaustive: There are categories into which no documents fall, and there are documents that do not fall into any category. Categories with 20+ occurrences are too few. ANN approach would probably not work with such few examples.
7
Example of a Reuter-21578 document 20-MAR-1987 16:54:10.55 earn usa GANTOS INC <GTOS> 4TH QTR JAN 31 NET GRAND RAPIDS, MICH., March 20 - Shr 43 cts vs 37 cts Net 2,276,000 vs 1,674,000 Revs 32.6 mln vs 24.4 mln
8
20-newsgroup data set Each document is in a separate text file. There are 1000 documents from each newsgroup. Each document has only one source newsgroup, so each document falls into only one category. The task of classification pertains to determining the source newsgroup of the document.
9
20-newsgroups data set alt.atheismrec.sport.hockey comp.graphicssci.crypt comp.os.ms-windows.miscsci.electronics comp.sys.ibm.pc.hardwaresci.med comp.sys.mac.hardwaresci.space comp.windows.xsoc.religion.christian misc.forsaletalk.politics.guns rec.autostalk.politics.mideast rec.motorcyclestalk.politics.misc rec.sport.baseballtalk.religion.misc
10
Example of a 20-newsgroup document Newsgroups: alt.atheism Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis. ohio-state.edu!zaphod.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125 Organization: Penn State University Date: Fri, 23 Apr 1993 18:54:23 EDT From: Message-ID: Subject: Re: YOU WILL ALL GO TO HELL!!! References: Lines: 1 jsn104 is jeremy scott noonan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.