Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee.

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Advertisements

Farag Saad i-KNOW 2014 Graz- Austria,
Identifying Sarcasm in Twitter: A Closer Look
Distant Supervision for Emotion Classification in Twitter posts 1/17.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Announcements  Project teams should be decided today! Otherwise, you will work alone.  If you have any question or uncertainty about the project, talk.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
中文信息处理 Chinese NLP Lecture 13.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Advanced Multimedia Text Classification Tamara Berg.
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Opinion Mining of Customer Feedback Data on the Web Presented By Dongjoo Lee, Intelligent Databases Systems Lab. 1 Dongjoo Lee School of Computer Science.
Spam Detection Ethan Grefe December 13, 2013.
Natural language processing tools Lê Đức Trọng 1.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Speaker : Shau-Shiang Hung ( 洪紹祥 ) Adviser : Shu-Chen Cheng ( 鄭淑真 ) Date : 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

KNN & Naïve Bayes Hongning Wang
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
A Sentiment-Based Approach to Twitter User Recommendation BY AJAY ABDULPUR RAJARAM NIKKAM.
Kim Schouten, Flavius Frasincar, and Rommert Dekker
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Machine Learning in Natural Language Processing
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval
Stance Classification of Ideological Debates
Presentation transcript:

Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee

Introduction People in the same professional area share similar literacy style. Is there any way to classify them automatically? Basic Idea Entertainer (Britney Spears) “I-I-I Wanna Go-o-o by the amazing” Politician (Barack Obama) “Instead of subsidizing yesterday’s energy, let’s invest in tomorrow’s.” IT Guru (Guy Kawasaki) “Exporting 20,439 photos to try to merge two computers. God help me.” 3 different professions on Twitter

Approach We concentrated on extracting as many features as possible from sentence, and selecting the most efficient features from them. Classifiers Support Vector Machine (SVM) Naïve Bayes (NB)

Features Basic Features -Binary value for each word => Large dimension of feature space -Word Stemming & Removing Stopwords -Used TF-IDF Weighting Syntactic Features -POS Tag -Using specific classes of POS tag (e.g. only nouns or only verbs, etc.)

Features Type of featuresExample Punctuation marks“awesome!!!!!” Capitalization“EVERYONE” Dates or Years“Mar 3,1833” Number or Rates“10% growth” Emoticons“ cool :) “ Retweet (Twitter Manual Features -Limitation of performance with using only automatically collected features -Manually selected features by human intelligence!

Feature Selection TF-IDF (Term Frequency – Inverse Document Frequency) -Usually used as a importance measure of a term in a document Information Gain -Tried to measure how we can reduce the uncertainty of labels if we know a word (or feature) Chi-square -Tried to measure dependency between the feature and the class

Implementation Caching System -Quite amount of time to process -Devised our own caching system to enhance the productivity NLP Libraries -Stanford POS Tagger -JWI WordNet Package developed by MIT for basic dictionary operations

Result Performance of Various Feature Extractors on SVM SVM with TF-IDF selection : 60% SVM with IG selection : 58% SVM with Chi-square : 52%

Result SVM with manual selection : 62% Performance of Manual Feature Extractors on SVM

Result NV without feature selection : 84% Random guess (benchmark) : 33% Performance of Classifiers (Naïve Bayes vs. SVM)

Conclusion -Naïve Bayes Classifier works surprisingly well without any feature engineering due to appropriate independence assumption -For SVM, selecting proper features is critical to avoid overfitting -Using only noun features works better than general