Download presentation
Presentation is loading. Please wait.
1
Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee
2
Introduction People in the same professional area share similar literacy style. Is there any way to classify them automatically? Basic Idea Entertainer (Britney Spears) “I-I-I Wanna Go-o-o by the amazing” Politician (Barack Obama) “Instead of subsidizing yesterday’s energy, let’s invest in tomorrow’s.” IT Guru (Guy Kawasaki) “Exporting 20,439 photos to try to merge two computers. God help me.” 3 different professions on Twitter
3
Approach We concentrated on extracting as many features as possible from sentence, and selecting the most efficient features from them. Classifiers Support Vector Machine (SVM) Naïve Bayes (NB)
4
Features Basic Features -Binary value for each word => Large dimension of feature space -Word Stemming & Removing Stopwords -Used TF-IDF Weighting Syntactic Features -POS Tag -Using specific classes of POS tag (e.g. only nouns or only verbs, etc.)
5
Features Type of featuresExample Punctuation marks“awesome!!!!!” Capitalization“EVERYONE” Dates or Years“Mar 3,1833” Number or Rates“10% growth” Emoticons“ cool :) “ Retweet (Twitter Specific)“RT @mjipeo” Manual Features -Limitation of performance with using only automatically collected features -Manually selected features by human intelligence!
6
Feature Selection TF-IDF (Term Frequency – Inverse Document Frequency) -Usually used as a importance measure of a term in a document Information Gain -Tried to measure how we can reduce the uncertainty of labels if we know a word (or feature) Chi-square -Tried to measure dependency between the feature and the class
7
Implementation Caching System -Quite amount of time to process -Devised our own caching system to enhance the productivity NLP Libraries -Stanford POS Tagger -JWI WordNet Package developed by MIT for basic dictionary operations
8
Result Performance of Various Feature Extractors on SVM SVM with TF-IDF selection : 60% SVM with IG selection : 58% SVM with Chi-square : 52%
9
Result SVM with manual selection : 62% Performance of Manual Feature Extractors on SVM
10
Result NV without feature selection : 84% Random guess (benchmark) : 33% Performance of Classifiers (Naïve Bayes vs. SVM)
11
Conclusion -Naïve Bayes Classifier works surprisingly well without any feature engineering due to appropriate independence assumption -For SVM, selecting proper features is critical to avoid overfitting -Using only noun features works better than general
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.