IR Homework #2 By J. H. Wang May 9, 2014. Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

Slides:



Advertisements
Similar presentations
Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Chapter 5: Introduction to Information Retrieval
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
K nearest neighbor and Rocchio algorithm
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
The identification of interesting web sites Presented by Xiaoshu Cai.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Homework Assignment #1 J. H. Wang Oct. 13, Homework #1 Chap.1: 1.24 Chap.2: 2.13 Chap.3: 3.5, 3.13* (or 3.14*) Chap.4: 4.6, 4.12* –(*: optional.
Homework Assignment #1 J. H. Wang Oct. 6, 2011.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Separating the Interface from the Engine: Creating Custom Add-in Tasks for SAS Enterprise Guide ® Peter Eberhardt Fernwood Consulting Group Inc.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Homework Assignment #1 J. H. Wang Oct. 11, 2013.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Text Classification and Naïve Bayes Text Classification: Evaluation.
2440: 141 Web Site Administration Web Forms Instructor: Joseph Nattey.
Big Data Processing of School Shooting Archives
Information Retrieval in Practice
A Simple Approach for Author Profiling in MapReduce
Proposal for Term Project
Homework Assignment #1 J. H. Wang Oct. 11, 2016.
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Information Retrieval
Big Data Analytics: HW#3
MMS Software Deliverables: Year 1
PRG 410 Competitive Success-- snaptutorial.com
PRG 410 Education for Service-- snaptutorial.com
PRG 410 Teaching Effectively-- snaptutorial.com
Introduction to javadoc
Chapter 27 WWW and HTTP.
Text Categorization Assigning documents to a fixed set of categories
Introduction to javadoc
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval
Homework #2 J. H. Wang Oct. 18, 2018.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
CS313T Advanced Programming language
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

IR Homework #2 By J. H. Wang May 9, 2014

Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input: Reuters test collection –predefined categories –labeled documents for training –test documents for testing Output: a classifier for each category

Input: Training and Test Sets Using Reuters collection –Available at: ml ml 21,578 news articles in 1987 (28.0MB uncompressed) –Distributed in 22 files in SGML format preprocessing of SGML tags File format: EADME.txt EADME.txt

Predefined Categories in Reuters category sets –Exchanges: 39 categories –Orgs: 56 categories –People: 267 categories –Places: 175 categories – Topics : 135 categories –  In this homework, ONLY the 135 Topical categories are considered in classification 10 largest classes –Earn, acquisitions, money-fx, grain, crude, trade, interest, ship, wheat, corn

Training and Test Sets Using Reuters for text classification –Modified Lewis (ModLewis) Split Training: 13,625 Test: 6,188 –Modified Apte (ModApte) Split: used in this homework Training: 9,603 Test: 3,299 –Modified Hayes (ModHayes) Split Training: 20,856 Test: 722

An Example Reuters Article 26-FEB :01:01.79 cocoa el- salvador usa uruguay … BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers continued throughout the week in … Training set in ModApte split Topical category Text content

Output: A Classifier Either write your own programs or use open source tools to implement any one of the following text classification methods: –Naïve Bayes (NB) classification (Ch.13) –Rocchio classification (Ch.14) –kNN classification (Ch.14) –SVM classification (Ch.15) –…

8 Test Document of what class? Government Science Arts Sec.14.1

Rocchio Classification Definition of centroid –Where D c is the set of all documents that belong to class c and v ( d ) is the vector space representation of d. Assign test documents to the category with the closest prototype vector based on cosine similarity

Tasks and Evaluations Your system should be able to complete the following tasks using the ModApte split in Reuters dataset –Training –Testing Evaluation of your system –Training: efficiency –Testing: precision/recall/F-measure

Example: Rocchio Classification HTML Parsing Centroid Calculation Cosine Similarity Training docs Test doc. Training Testing centroids class Evaluation P, R, F1

Example Steps in Rocchio Classification 1.Parse the HTML documents in the Reuters dataset. Find out the text body, topics, and separate them into training and test document. –Body as content, topics as class 2.For each document, calculate the TF-IDF weights from the text body as a vector. 3.For each training document, calculate the centroid by summing all the vectors in each topic class. –So you will get 135 centroids, one for each topic class. 4.For each test document, find out the most similar centroid using cosine similarity as the class it belongs to. 5.Compare the class with the answer (in the topics tag), and evaluate how many test documents are correctly classified.

Optional Functionalities Feature selection: (Sec. 13.5) –mutual information –chi-square –… User Interface –For selecting test documents Visualization of classification result …

Submission Your submission *should* include –The source code (and your executable file) –A complete user manual (or a UI) for testing –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: extended to three weeks (May 30, 2014)

Submission Instructions Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: FTP server: localhost User name & password: Your student ID – Preparing your submission file : as one single compressed file Remember to specify the names of your team members and student ID in the files and documentation –If you cannot successfully submit your work, please contact with the TA R1424, Technology Building)

Evaluation Two options: –Your system automatically classifies all test documents, and displays classification results and the effectiveness (precision, recall, F-measure, accuracy) The preferred option –Your system can randomly select some test documents (by their IDs), and run your classifier to show the classification result (both your classifier output, and the answer) Minimum requirement –Training and testing phases can be successfully completed –Optional features will be considered as bonus E.g. feature selection, UI, visualization, … You might be required to demo if the classifier submitted was unable to run by TA

Any Questions or Comments?