Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Homework #2 By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

Similar presentations


Presentation on theme: "IR Homework #2 By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:"— Presentation transcript:

1 IR Homework #2 By J. H. Wang May 9, 2014

2 Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input: Reuters-21578 test collection –predefined categories –labeled documents for training –test documents for testing Output: a classifier for each category

3 Input: Training and Test Sets Using Reuters-21578 collection –Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.ht ml http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.ht ml 21,578 news articles in 1987 (28.0MB uncompressed) –Distributed in 22 files in SGML format preprocessing of SGML tags File format: http://kdd.ics.uci.edu/databases/reuters21578/R EADME.txt http://kdd.ics.uci.edu/databases/reuters21578/R EADME.txt

4 Predefined Categories in Reuters-21578 5 category sets –Exchanges: 39 categories –Orgs: 56 categories –People: 267 categories –Places: 175 categories – Topics : 135 categories –  In this homework, ONLY the 135 Topical categories are considered in classification 10 largest classes –Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn

5 Training and Test Sets Using Reuters-21578 for text classification –Modified Lewis (ModLewis) Split Training: 13,625 Test: 6,188 –Modified Apte (ModApte) Split: used in this homework Training: 9,603 Test: 3,299 –Modified Hayes (ModHayes) Split Training: 20,856 Test: 722

6 An Example Reuters Article 26-FEB-1987 15:01:01.79 cocoa el- salvador usa uruguay … BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers continued throughout the week in … Training set in ModApte split Topical category Text content

7 Output: A Classifier Either write your own programs or use open source tools to implement any one of the following text classification methods: –Naïve Bayes (NB) classification (Ch.13) –Rocchio classification (Ch.14) –kNN classification (Ch.14) –SVM classification (Ch.15) –…

8 8 Test Document of what class? Government Science Arts Sec.14.1

9 Rocchio Classification Definition of centroid –Where D c is the set of all documents that belong to class c and v ( d ) is the vector space representation of d. Assign test documents to the category with the closest prototype vector based on cosine similarity

10 Tasks and Evaluations Your system should be able to complete the following tasks using the ModApte split in Reuters-21578 dataset –Training –Testing Evaluation of your system –Training: efficiency –Testing: precision/recall/F-measure

11 Example: Rocchio Classification HTML Parsing Centroid Calculation Cosine Similarity Training docs Test doc. Training Testing centroids class Evaluation P, R, F1

12 Example Steps in Rocchio Classification 1.Parse the HTML documents in the Reuters-21578 dataset. Find out the text body, topics, and separate them into training and test document. –Body as content, topics as class 2.For each document, calculate the TF-IDF weights from the text body as a vector. 3.For each training document, calculate the centroid by summing all the vectors in each topic class. –So you will get 135 centroids, one for each topic class. 4.For each test document, find out the most similar centroid using cosine similarity as the class it belongs to. 5.Compare the class with the answer (in the topics tag), and evaluate how many test documents are correctly classified.

13 Optional Functionalities Feature selection: (Sec. 13.5) –mutual information –chi-square –… User Interface –For selecting test documents Visualization of classification result …

14 Submission Your submission *should* include –The source code (and your executable file) –A complete user manual (or a UI) for testing –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: extended to three weeks (May 30, 2014)

15 Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: http://140.124.183.31/net2ftp FTP server: localhost User name & password: Your student ID – Preparing your submission file : as one single compressed file Remember to specify the names of your team members and student ID in the files and documentation –If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)

16 Evaluation Two options: –Your system automatically classifies all test documents, and displays classification results and the effectiveness (precision, recall, F-measure, accuracy) The preferred option –Your system can randomly select some test documents (by their IDs), and run your classifier to show the classification result (both your classifier output, and the answer) Minimum requirement –Training and testing phases can be successfully completed –Optional features will be considered as bonus E.g. feature selection, UI, visualization, … You might be required to demo if the classifier submitted was unable to run by TA

17 Any Questions or Comments?


Download ppt "IR Homework #2 By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:"

Similar presentations


Ads by Google