Presentation is loading. Please wait.

Presentation is loading. Please wait.

Project 1: Text Classification by Neural Networks

Similar presentations


Presentation on theme: "Project 1: Text Classification by Neural Networks"— Presentation transcript:

1 Project 1: Text Classification by Neural Networks
Ver 1.1

2 (C) 2006, SNU Biointelligence Laboratory
Outline Classification using ANN Learn and classify text documents Estimate several statistics on the dataset (C) 2006, SNU Biointelligence Laboratory

3 (C) 2006, SNU Biointelligence Laboratory
Network Structure Class 1 Class 3 Class 2 Input (C) 2006, SNU Biointelligence Laboratory

4 CLASSIC3 Dataset

5 (C) 2006, SNU Biointelligence Laboratory
CLASSIC3 Three categories: 3891 documents CISI: 1,460 document abstracts on information retrieval from Institute of Scientific Information. CRAN: 1,398 document abstracts on Aeronautics from Cranfield Institute of Technology. MED: 1,033 biomedical abstracts from MEDLINE. (C) 2006, SNU Biointelligence Laboratory

6 Text Presentation in Vector Space
문서집합 stemming stop-words elimination feature selection . . . VSM representation 1 2 3 1 Term vectors 1 2 Bag-of-Words representation d1 d2 d3 dn baseball 1 2 3 specs graphics hockey Term-document matrix unix Dataset Format space (C) 2006, SNU Biointelligence Laboratory

7 Dimensionality Reduction
term (or feature) vectors individual feature Scoring measure (on individual feature) Sort by score scores choose terms with higher values documents in vector space ML algorithm Term Weighting TF or TF x IDF TF: term frequency IDF: Inverse Document Frequency N: Number of documents ni: number of documents that contain the j-th word (C) 2006, SNU Biointelligence Laboratory

8 Construction of Document Vectors
Controlled vocabulary Stopwords are removed Stemming is used. Words of which document frequency is less than 5 is removed.  Term size: 3,850 A document is represented with a 3,850-dimensional vector of which elements are the frequency of words. Words are sorted according to their values of information gain.  Top 100 terms are selected  3,830 (examples) x 100 (terms) matrix (C) 2006, SNU Biointelligence Laboratory

9 Experimental Results

10 Data Setting for the Experiments
Basically, training and test set are given. Training : 2,683 examples Test : 1,147 examples N-fold cross-validation (Optional) Dataset is divided into N subsets. The holdout method is repeated N times. Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set. The average performance across all N trials is computed. (C) 2006, SNU Biointelligence Laboratory

11 (C) 2006, SNU Biointelligence Laboratory
Number of Epochs (C) 2006, SNU Biointelligence Laboratory

12 (C) 2006, SNU Biointelligence Laboratory
Number of Hidden Units Number of Hidden Units Minimum 10 runs for each setting # Hidden Units Train Test Average  SD Best Worst Setting 1 Setting 2 Setting 3  (C) 2006, SNU Biointelligence Laboratory

13 (C) 2006, SNU Biointelligence Laboratory

14 Other Methods/Parameters
Normalization method for input vectors Class decision policy Learning rates …. (C) 2006, SNU Biointelligence Laboratory

15 (C) 2006, SNU Biointelligence Laboratory
ANN Sources Source codes Free software  Weka NN libraries (C, C++, JAVA, …) MATLAB tool box Web sites (C) 2006, SNU Biointelligence Laboratory

16 (C) 2006, SNU Biointelligence Laboratory
Submission Due date: October 12 (Thur) Both ‘hardcopy’ and ‘ ’ Used software and running environments Experimental results with various parameter settings Analysis and explanation about the results in your own way FYI, it is not important to achieve the best performance (C) 2006, SNU Biointelligence Laboratory


Download ppt "Project 1: Text Classification by Neural Networks"

Similar presentations


Ads by Google