Download presentation
Presentation is loading. Please wait.
Published byEthelbert Mitchell Modified over 9 years ago
1
IR Homework #3 By J. H. Wang May 10, 2012
2
Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input: Reuters-21578 test collection –predefined categories –labeled documents for training –test documents for testing Output: a classifier for each category
3
Input: Training and Test Sets Using Reuters-21578 collection –Available at: http://www.daviddlewis.com/resources/testcollections/reuter s21578/ http://www.daviddlewis.com/resources/testcollections/reuter s21578/ 21,578 news articles in 1987 (28.0MB uncompressed) –Distributed in 22 files in SGML format preprocessing of SGML tags File format: http://www.daviddlewis.com/resources/testcoll ections/reuters21578/readme.txt http://www.daviddlewis.com/resources/testcoll ections/reuters21578/readme.txt
4
Predefined Categories in Reuters-21578 5 category sets –Exchanges: 39 categories –Orgs: 56 categories –People: 267 categories –Places: 175 categories –Topics: 135 categories 10 largest classes –Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn
5
Training and Test Sets Using Reuters-21578 for text classification –Modified Lewis (ModLewis) Split Training: 13,625 Test: 6,188 –Modified Apte (ModApte) Split Training: 9,603 Test: 3,299 –Modified Hayes (ModHayes) Split Training: 20,856 Test: 722
6
Output: A Classifier Either your own program(s) or open source tools –Naïve Bayes (NB) classification (Ch.13) –Rocchio classification (Ch.14) –kNN classification (Ch.14) –SVM classification (Ch.15) –…
7
7 Test Document of what class? Government Science Arts Sec.14.1
8
Rocchio Classification Definition of centroid –Where D c is the set of all documents that belong to class c and v ( d ) is the vector space representation of d. Assign test documents to the category with the closest prototype vector based on cosine similarity
9
Evaluation of Classification Results Test queries randomly selected from Reuters-21578 test set –Training: efficiency –Testing: precision/recall/F-measure
10
Optional Functionalities Feature selection: (Sec. 13.5) –mutual information –chi-square –… User Interface –For classifying test queries Visualization of classification result …
11
Submission Your submission *should* include –The source code (and your executable file) –A complete user manual (or a UI) for testing –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: two weeks (May 24, 2012)
12
Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: http://140.124.183.39/IR/http://140.124.183.39/IR/ Username: your student ID Password: (Please change your default password at your first login) – Preparing your submission file : as one single compressed file Remember to specify the names of your team members and student ID in the files and documentation –If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)
13
Evaluation Randomly selected test queries will be submitted to your classifier, and checked for effectiveness (F-measure) – Minimum requirement Training and testing phases can be successfully completed Effectiveness for the 10 largest classes can be evaluated Optional features will be considered as bonus –Feature selection, UI, visualization, … You might be required to demo if the classifier submitted was unable to run by TA
14
Any Questions or Comments?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.