Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
LOGO Classification IV Lecturer: Dr. Bo Yuan
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Classification and Decision Boundaries
Discriminative and generative methods for bags of features
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Text Classification With Support Vector Machines
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Rutgers CS440, Fall 2003 Support vector machines Reading: Ch. 20, Sec. 6, AIMA 2 nd Ed.
Face Recognition with Harr Transforms and SVMs EE645 Final Project May 11, 2005 J Stautzenberger.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
Recent Results in Support Vector Machines Dave Musicant Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
This week: overview on pattern recognition (related to machine learning)
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
The Broad Institute of MIT and Harvard Classification / Prediction.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Handwritten digit recognition
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,
An Introduction to Support Vector Machine (SVM)
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
COSC 4335: Other Classification Techniques
Other Classification Models: Support Vector Machine (SVM)
Concave Minimization for Support Vector Machine Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News

Text categorization Classify text to predefined categories Supervised learning Labeled corpus Used in Indexing (e.g. Libraries)‏ News articles Spam filtering 1 /

2 / 22

Bilkent News Portal Gather news from different news providers News are more accessible New event detection and tracking Novelty detection Dublicate elimination Personalization News Categorization 3 / 22

Aktuel En Çok Okunanlar Anasayfa Spor Politika Çevre Tüm Haberler Son Dakika News CATEGORIES 4 / 22

Motivation News are categorized From Rss 24 good categories A few bad categories Anasayfa EnCokOkunanlar Gundem SonDakika Tum_Haberler Yazarlar Aktuel Avrupa_Futbol BilimTeknoloji Bilisim Cevre DisHaberler Dunya Ege Egitim Ekonomi Formula1 Hava_Yol Ispanya Italya KulturSanat Politika Saglik Siyaset Spor Televizyon Turkiye Yasam Yazarlar YurtHaberler 5 / 22

Data Set Categories are skewed (not homogene)‏ 6 / 22

Approach Classifiers K Nearest Neighbour (kNN)‏ Support Vector Machines (SVM)‏ Use training set News with good categories Use test set News with good categories (for evaluation)‏ Evaluation Test with already categorized news 7 / 22 Found to be best [1]

Methodology Cleaning Noises Preprocessing Indexing Document Classification 8 / 22

Cleaning Noises News documents coming from different RSS feeds generally contain noises such as advertisements, hypertexts, etc. Increase the similarity between documents which contain the same or similar noises Decrease in the performance of the systems as Bilkent News Portal, which uses similarity between documents. 9 / 22

Cleaning Noises (cntd.) Cleaning process of noises such as hypertexts is easily done by removing the sentences contain these noises. Their pattern do not change for each news document coming from different RSS feeds. E.g. hypertexts, which contain links to other documents, are defined as “ ”. 10 / 22

Cleaning Noises (cntd.) Each RSS feed attaches specific advertisements to its news documents. No general pattern for all news documents. After a while, even the same RSS feed changes the advertisement being attached to its news documents. 11 / 22

Cleaning Noises (cntd.) Compare two consecutive news documents from the same RSS feed sentence-by-sentence. (Each sentence is compared with every sentence of the consecutive news document) Calculate the similarity between each sentence by using Cosine Similarity. 12 / 22

Preprocessing Stemming – Zemberek API is used. Stop word list comparison – frequently occuring words are not taken into consideration. 13 / 22

Document Indexing  Creating vector space model from index terms + feature selection may be a costly process.  Consistency with Bilkent News Portal  Lemur[2] for document indexing operation.  Lemur can only index predefined formats  TREC text 14 / 22

Document Classification Two different approaches to assess which one performs better:  K-Nearest Neighbor  Support Vector Machines 15 / 22

K Nearest Neighbor Given training data D (categorized news in our case), goal is to assign test point X (news with unknown category in our case) to label of associated closest neighbors in D. As distance function to specify k nearest news, we again used Lemur. Lemur can also retrieve documents according to some score calculation. Lemur is quite fast at retrieving k similar documents. 16 / 22

Support Vector Machines Support Vector Machines (SVMs), Applied to various problems, Data in k-dim space, Find a hyperplane (i.e subset with k-1 dim), Several possible hyperplanes.. 17 / 22

Support Vector Machines (cntd.)‏ Figure 2. Possible hyperplanes in a sample space. Margin of u2 Support Vectors 18 / 22

Support Vector Machines (cntd.)‏ Aim: Find a hyperplane correctly classifying with maximal margin, Support vectors are only effective, Represent a hyperplane : 19 / 22

Support Vector Machines (cntd.)‏ Figure 3. A sample linear SVM. 20 / 22

Support Vector Machines (cntd.)‏ Figure 4. A sample non-linear SVM. Figure 5. Mapping with kernel function. 21 / 22

Conclusion The necessity of Turkish news categorization is covered.  Bilkent News Portal: RSS feeds may lack category information or having unrealistic categories such as Last Minute, Main Page, Agenda etc. A categorization methodology for Turkish news is proposed. Finding the correct category is done both by KNN (base classifier) and SVM to evaluate which one performs better. KNN for 100 experiments, 60% success. 22 / 22

References [1] Yang, Y. and Liu, X., A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999 [2]

Questions? Thank you for listening…