Project 1: Machine Learning Using Neural Networks Ver 1.1.

Project 1: Machine Learning Using Neural Networks Ver 1.1

CLASSIC3 Dataset

5(C) 2006, SNU Biointelligence Laboratory CLASSIC3 Three categories: 3891 documents  CISI: 1,460 document abstracts on information retrieval from Institute of Scientific Information.  CRAN: 1,398 document abstracts on Aeronautics from Cranfield Institute of Technology.  MED: 1,033 biomedical abstracts from MEDLINE.

6(C) 2006, SNU Biointelligence Laboratory Text Presentation in Vector Space... 1000200103010001 문서집합 Term vectors 10201010 01131001 20000101 00100030 02100001 00300100 10110021 01101000 00003100 baseball specs graphics hockey unix space d1d2d3dndn Term-document matrix stemming stop-words elimination feature selection 10100002 Bag-of-Words representation VSM representation Dataset Format

7(C) 2006, SNU Biointelligence Laboratory Dimensionality Reduction Sort by score Scoring measure (on individual feature) ML algorithm term (or feature) vectors choose terms with higher values individual feature scores Term Weighting TF or TF x IDF documents in vector space TF: term frequency IDF: Inverse Document Frequency N: Number of documents n i : number of documents that contain the j-th word

8(C) 2006, SNU Biointelligence Laboratory Construction of Document Vectors Controlled vocabulary  Stopwords are removed  Stemming is used.  Words of which document frequency is less than 5 is removed.  Term size: 3,850 A document is represented with a 3,850-dimensional vector of which elements are the frequency of words.  Words are sorted according to their values of information gain.  Top 100 terms are selected  3,830 (examples) x 100 (terms) matrix

Experimental Results

10(C) 2006, SNU Biointelligence Laboratory Data Setting for the Experiments Basically, training and test set are given.  Training : 2,683 examples  Test : 1,147 examples N-fold cross-validation (Optional)  Dataset is divided into N subsets.  The holdout method is repeated N times.  Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set.  The average performance across all N trials is computed.

12(C) 2006, SNU Biointelligence Laboratory Number of Hidden Units  Minimum 10 runs for each setting # Hidden Units TrainTest Average  SD BestWorst Average  SD BestWorst Setting 1 Setting 2 Setting 3 

15(C) 2006, SNU Biointelligence Laboratory ANN Sources Source codes  Free software  Weka  NN libraries (C, C++, JAVA, …)  MATLAB tool box Web sites  http://www.cs.waikato.ac.nz/~ml/weka/ http://www.cs.waikato.ac.nz/~ml/weka/  http://www.faqs.org/faqs/ai-faq/neural-nets/part5/ http://www.faqs.org/faqs/ai-faq/neural-nets/part5/

16(C) 2006, SNU Biointelligence Laboratory Submission Due date: April 18 (Tue) Both ‘hardcopy’ and ‘email’  Used software and running environments  Experimental results with various parameter settings  Analysis and explanation about the results in your own way  FYI, it is not important to achieve the best performance

Project 1: Machine Learning Using Neural Networks Ver 1.1.

Similar presentations

Presentation on theme: "Project 1: Machine Learning Using Neural Networks Ver 1.1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project 1: Machine Learning Using Neural Networks Ver 1.1.

Similar presentations

Presentation on theme: "Project 1: Machine Learning Using Neural Networks Ver 1.1."— Presentation transcript:

Similar presentations

About project

Feedback