Text Classification using SVM- light DSSI 2008 Jing Jiang.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Albert Gatt Corpora and Statistical Methods Lecture 13.
Classification / Regression Support Vector Machines
Farag Saad i-KNOW 2014 Graz- Austria,
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Text Classification With Support Vector Machines
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Overview of Search Engines
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
The identification of interesting web sites Presented by Xiaoshu Cai.
Text Classification, Active/Interactive learning.
Part II Support Vector Machine Algorithms. Outline  Some variants of SVM  Relevant algorithms  Usage of the algorithms.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Support Vector Machine PNU Artificial Intelligence Lab. Kim, Minho.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Spam Detection Ethan Grefe December 13, 2013.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Watch Listen & Learn: Co-training on Captioned Images and Videos
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ A Study of Text Categorization Classifying.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Weka. Weka A Java-based machine vlearning tool Implements numerous classifiers and other ML algorithms Uses a common.
Classification using Co-Training
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
A Simple Approach for Author Profiling in MapReduce
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Sentiment Analysis Study
Juweek Adolphe Zhaoyu Li Ressi Miranda Dr. Shang
Machine Learning Week 1.
Presented by: Prof. Ali Jaoua
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Text Analytics Solutions with Azure Machine Learning
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Text Classification using SVM- light DSSI 2008 Jing Jiang

Text Classification Goal: to classify documents (news articles, s, Web pages, etc.) into predefined categories Examples –To classify news articles into “business” and “sports” –To classify Web pages into personal home pages and others –To classify product reviews into positive reviews and negative reviews Approach: supervised machine learning –For each pre-defined category, we need a set of training documents known to belong to the category. –From the training documents, we train a classifier.

Overview Step 1—text pre-processing – to pre-process text and represent each document as a feature vector Step 2—training –to train a classifier using a classification tool (e.g. SNoW, SVM-light) Step 3—classification –to apply the classifier to new documents

Pre-processing: tokenization Goal: to separate text into individual words Example: “We’re attending a tutorial now.”  we ’re attending a tutorial now Tool: –Word Splitter

Pre-processing: stop word removal (optional) Goal: to remove common words that are usually not useful for text classification Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. Stop word list: – utils/stop_wordshttp:// utils/stop_words

Pre-processing: stemming (optional) Goal: to normalize words derived from the same root Examples: –attending  attend –teacher  teach Tool: –Porter stemmer

Pre-processing: feature extraction Unigram features: to use each word as a feature –To use TF (term frequency) as feature value –To use TF*IDF (inverse document frequency) as feature value –IDF = log (total-number-of-documents / number-of- documents-containing-t) Bigram features: to use two consecutive words as a feature Tool: –Write your own program/script –Lemur API

Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout termID() << endl; cout termCount() << endl; } delete dList; delete ind; Using Lemur to Extract Unigram Features

SVM (Support Vector Machines) A learning algorithm for classification –General for any classification problem (text classification as one example) Binary classification Maximizes the margin between the two different classes

picture from rial.pdf rial.pdf

SVM-light SVM-light: a command line C program that implements the SVM learning algorithm Classification, regression, ranking Download at Documentation on the same page Two programs –svm_learn for training –svm_classify for classification

SVM-light Examples Input format 1 1:0.5 3:1 5: :0.9 3:0.1 4:2 To train a classifier from train.data –svm_learn train.data train.model To classify new documents in test.data –svm_classify test.data train.model test.result Output format –Positive score  positive class –Negative score  negative class –Absolute value of the score indicates confidence Command line options –-c a tradeoff parameter (use cross validation to tune)

More on SVM-light Kernel –Use the “-t” option –Polynomial kernel –User-defined kernel Semi-supervised learning (transductive SVM) –Use “0” as the label for unlabeled examples –Very slow