Download presentation
Presentation is loading. Please wait.
1
Author Identification for LiveJournal Alyssa Liang
2
The problem LiveJournal – a blogging website Given a document (an entry), identify the author Hierarchical classification first classify by gender then classify author based on gender Document MaleFemale Male 1Male 2Female 1Female 2Female 3
3
Features Unigrams & Bigrams Average sentence and word length Number of words and distinct words Number of sentences in paragraph Number of UPPERCASE characters Number of words not in the dictionary Number of words with length <= 4 Number of characters in italics, bold, and striked out
4
The 3 Classifiers Naïve Bayes – generative model Assumes features in document are independent of each other Implemented multi-variate Bernoulli model Only represented if feature appeared in document, not number of times feature appears Decision Trees An internal nodes is a test of a feature, and each branch from the node represents the values it can take A leaf node represents a classification Build a smallish tree from the training data using minimum average entropy Maximum Entropy – conditional model “model all that is known and assume nothing is unknown” Tries to find most uniform model that satisifies constraints, i.e. maximize the entropy
5
Results Hierarchical classification has no benefits Need to improve gender classification – could use different features Hierarchical Feature Reduction (on gender classification) took 512 most important features and reran maxent training; then took 256 most important features, etc. Proved to be very stable Best features consisted mostly of bigrams (many of which contained punctuation). Also chose features where there was a large difference between male and female (number of distinct words, UPPERCASE letters, short words)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.