© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/2011 1 A Study of Text Categorization Classifying.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Data Mining Classification: Alternative Techniques

Machine learning continued Image source:

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

What is Statistical Modeling

Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)

Classification and Decision Boundaries

Introduction to Data Mining with XLMiner

Data Mining Classification: Naïve Bayes Classifier

Lecture Notes for Chapter 4 Introduction to Data Mining

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Ensemble Learning: An Introduction

Scalable Text Mining with Sparse Generative Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSCI 347 / CS 4206: Data Mining Module 01: Introduction Topic 03: Stages in Data Mining.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

INTRODUCTION TO DATA MINING MIS2502 Data Analytics.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Universit at Dortmund, LS VIII

Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN

Classification Techniques: Bayesian Classification

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

MIS2502: Data Analytics Advanced Analytics - Introduction.

An Exercise in Machine Learning

Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Data Mining and Decision Support

Machine Learning 5. Parametric Methods.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

CIS 335 CIS 335 Data Mining Classification Part I.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.

CSE 4705 Artificial Intelligence

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Queensland University of Technology

Introduction Machine Learning 14/02/2017.

School of Computer Science & Engineering

Data Mining Lecture 11.

Advanced Artificial Intelligence Classification

Data Mining: Introduction

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

LECTURE 23: INFORMATION THEORY REVIEW

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining CSCI 307, Spring 2019 Lecture 8

Presentation transcript:

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ A Study of Text Categorization Classifying Programming Newsgroup Discussions using Text Categorization Algorithms by Lingfeng Mo

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ What is text categorization? l Definition –Classification of documents into a fixed number of predefined categories. –Sometimes alternately referred to as text data mining.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ What is Data Mining? l Many Definitions –Non-trivial extraction of implicit, previously unknown and potentially useful information from data –Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Data Mining Tasks... l Classification [Predictive] l Clustering [Descriptive] l Association Rule Discovery [Descriptive] l Sequential Pattern Discovery [Descriptive] l Regression [Predictive] l Deviation Detection [Predictive]

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Data Mining Tasks l Prediction Methods –Use some variables to predict unknown or future values of other variables. l Description Methods –Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class. l Find a model for class attribute as a function of the values of other attributes. l Goal: previously unseen records should be assigned a class as accurately as possible. –A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Classification Example Test Set Training Set Model Learn Classifier Randomly choose certain portion 12 German documents 12 English documents

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Classification Example Result

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Why our study? l Programmer often seek and exchange information about problems on using a certain library, framework, or API online. l Titles not corresponding to the content in newsgroup discussion. l Novice doesn’t know how to ask a question exactly. l By categorizing an ongoing discussion, such techniques could be used to directly point out previous discussions of similar problems to the developers who ask questions.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ What’s this study for? l Ideal Goal: Automatically classifying discussions into meaningful semantic categories. –Approach:  Collect and save raw data  Import and optimize data  Select certain portion of data to train a classifier model.  Classify data  Evaluate results

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Tool we use l MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. --a machine learning for language toolkit. l Via:

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Data Collecting l Download discussions from Java programming forum l Save each discussion as a text document(.txt) – Article(text, name) l Manually put similar discussions into the same folder (labels)

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Import Data l Input: Labels included with their articles –How it works? l Output: Mallet document Char Sequence Token Sequenc e Feature Vectors Data Name of Article Name/ Source Label Target Name of Drive Name of Folder ++ Name of Article ｝ Instance

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Example of Import Data l Import-file l Import-dir

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Training Classifier l Input the produced data by importing process l Set training portion for the training set and test set l K-Fold Cross-Validation – 10 trials usually l Set the trainer - NaiveBayesTrainer

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Basic Bayes theorem l A probabilistic framework for solving classification problems l Conditional Probability: l Bayes theorem:

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Example of Bayes Theorem l Given: –A doctor knows that cold causes cough 50% of the time –Prior probability of any patient having cold is 1/50,000 –Prior probability of any patient having cough is 1/20 l If a patient has cough, what’s the probability he/she has cold?

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Example of Train Classifier

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Output of Classification l Confusion matrix l Test data accuracy for every trials l Train data accuracy mean –Standard Deviation –Standard Error l Test data accuracy mean –Standard Deviation –Standard Error

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Exp. picture of Classification(1 of 2)

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Exp. picture of Classification(2 of 2)

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ How to improve the accuracy? l Increase recognition rate – Words Splitting l Unify words’ tense - Words Stemming l Get rid of noisy data – Remove StopWords l About overlapped categories - Top N Method

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Words stemming l Change Verb’s Tense back to original –Ex. Performed -> perform

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Words Splitting (1 of 3) l In what case we could split a word? –Punctuation –Blank –Under Line

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Words Splitting (2 of 3) l Examples: –Ex. Set_Value -> Set Value; –ImageIcon("myIcon.gif")); -> ImageIcon myIcon gif; –actionPerformed(ActionEvent e) -> actionPerformed Action Event e l See any problems? –There are some cases that people like to write many words or words with numbers together. –Ex. JButton, actionListener, Button1,2,3 and etc.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Words Splitting (3 of 3) l What the special cases are? –1. Begin with any number of capital letters combined with ONE or couple words. Ex. JFrame -> J Frame; JJJJJJJButton -> JJJJJJJ Button; JButtonApple -> J Button Apple –2. lower case letter/letters or lower case word combined with a word begin with capital letter Ex. cButton -> c Button; ccccccccButton -> cccccccc Button; setValue -> set Value; addActionListener -> add Action Listener; –3. Many words ALL begin with capital letter combined togeter. Ex. MyFrame-> My Frame; SetActionCommand -> Set Action Command –4. Combined with word and numbers EX. Button1 -> Button 1; 1Button -> 1 Button Button123 -> Button 123; 123Button -> 123 Button;

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Remove Stopwords (1 Of 2) l What is stop words? –The most common, short function words, such as the, is, at, which, and, on. l Any special cases?

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Remove Stopwords (2 Of 2) l Extra Stop Words –Programming words. Ex. public, private, class, new and etc. l Words Frequency Counter helps.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Overlapped Categories l Each category is treated as independent label by default. l How to solve realistic problems? – Top N Method

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Top N Method Regular Way Top N Method

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Some of our test results l Test base on 10 different labels and 45 instances in total. l Let’s see some pictures help us directly perceived through the senses

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Classify data with Original Mallet Lowest: 19% Highest: 32% Average: 25.9%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ After Stemming & Words Splitting Lowest: 26% Highest: 44% Average: 35%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ After remove Stop Words Lowest: 36% Highest: 62% Average: 45%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Top N Method been used Lowest: 54% Highest: 72 % Average: 63.6%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Any way to improve accuracy again? l Highlight the key feature. l Use only code data as training data.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Only code Data Delete all the text other than code. What is considered as code? Code includes not only a snippet of code more than one line, but also a class name, such as JButton and JActionListener, or a method call, such as addActionListener(aListener).

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Test Result of Code only Data Lowest: 24% Highest: 42% Average: 34% Lowest: 54% Highest: 72 % Average: 63.6%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Why this happened? l Only code data is not enough. l Can not remove too much data, especially those actually contributed to feature selection. l Is our data size not big enough?

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Increase the Data Scale l What we done? - Increase the total instances from 45 – Increase the num of labels from l Data analysis and Quality improvement since categories may overlap

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ After Data Scale Increased Lowest: 36% Highest: 62% Average: 45% Lowest: 51.88% Highest: 67.5 % Average: 59.75%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ After Data Scale Increased with Top N Lowest: 54% Highest: 72 % Average: 63.6% Lowest: 66.25% Highest: % Average: 72.05%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Why the accuracy increased? l Naïve Bayes classifier using Gaussian distribution to represent the class-conditional probability for continuous attributes, so we are wondering that if the frequency distribution of each word in the articles looks like a normal distribution l Count the frequencies of each word in the articles to create a histogram and to see whether the histogram looks like a normal distribution

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Histogram for Word A

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Histogram for Word B

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Only Code Data again Lowest: % Highest: % Average: % Lowest: 66.25% Highest: % Average: 72.05%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Data Without Code Lowest: 67.5 % Highest: % Average: % Lowest: 66.25% Highest: % Average: 72.05%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Results Analysis l Different from human beings, code is not the decisive factor. l Base on our prepared data, code is only a small part of a single instance.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Compare to Maximum Entropy Lowest: 57.5 % Highest: 70 % Average: 63.06% Lowest: 51.88% Highest: 67.5 % Average: 59.75%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Maximum Entropy with Top N Lowest: 69.38% Highest: % Average: 78.63% Lowest: 66.25% Highest: % Average: 72.05%

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Generative and discriminative models l Generative (Joint) model -> P(c,d) Place P over both observed data and hidden stuff. - Ex. Naive Bayes l Discriminative (Conditional) models -> P(c|d) Take data as given, place a P over hidden structure given the data. -Ex. Maximum Entropy, SVMs Let’s see a picture directly.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Generative and discriminative models l Bayes net diagrams draw circles for random variables, and lines for direct dependencies l Some variables are observed; some are hidden l Each node (conditional model) is a little classifier based on incoming arcs C d1d1 d2d2 d3d3 c d1d1 d2d2 d3d3 Conditional modelsJoint models

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ SVM-light l SVM-light is an implementation of Support Vector Machines (SVMs) in C. l Download at l Solves classification and regression problems. solves ranking problems l Efficiently computes Leave-One-Out estimates of the error rate, the precision, and the recall. l Supports standard kernel functions and lets you define your own

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Format of input file l The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format: l.=. : :... : # l Explain.=. +1 | -1 | 0 |.=..=..=.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ How it Works – Linear Mapping picture from rial.pdf rial.pdf

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Polynomial mapping picture from rial.pdf rial.pdf

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Future Work l On-going experiments, result analysis based on SVM l Continually Increase Data scale l Additional increase the identification rate on SVM. l Compare the accuracy difference between generative (Naive Bayes) and discriminative (SVM) models base on our results since we think the main reason that determines the accuracy is not the tool itself but how to select the suitable tool that best matches the data model that underlies a given text categorization problem.

© Lingfeng Mo Classifying Programming Newsgroup Discussions using Text Categorization Algorithms 1/19/ Reference l McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." l Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, 2005 l Lecture 7 of Artificial Intelligence | Natural Language Processing Course at Stanford. Instructor: Manning, Christopher D. l K. Nigam, J. Lafferty, and A. Mccallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, l T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, l T. Joachims. Text categorization with support vector machines: Learning with many relevant features. University at Dortmund, LS VIII, 1997.