Text Classification Seminar Social Media Mining University UC3M Date May 2017 Lecturer Carlos Castillo http://chato.cl/ Sources: CS124 slides by Dan Jurafsky Slides by Muhammad Atif Qureshi & Arjuman Younus – 2017
Facebook study (comments and timeline posts) Burke, Moira, Lada A. Adamic, and Karyn Marciniak. "Families on Facebook." In ICWSM. 2013. Featured in a blogpos by M. Burke.
Example applications “Federalist Papers” in USA Gmail smart folders Mosteller, Frederick, and David L. Wallace. "Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers." Journal of the American Statistical Association 58, no. 302 (1963): 275-309. Gmail smart folders
Per-document frequency of use of the word “you” in fiction documents Male author Female author “even in formal writing, female writing exhibits greater usage of features identified by previous researchers as ‘involved’ while male writing exhibits greater usage of features which have been identified as ‘informational’.” Argamon, S., Koppel, M., Fine, J. and Shimoni, A.R., 2003. Gender, genre, and writing style in formal written texts. TEXT 23(3), pp.321-346.
Positive or negative review? Given a text, determine if the author is praising or complaining about a monument / landmark http://mashable.com/2015/01/09/one-star-yelp-historical-landmarks/
Academic articles Antagonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology …
Text classification problems Generic documents → Topics, Keywords, … → Author age, Author gender, … → Language Messages → Folder(s), Priority, Spam?, … Usual approach: supervised learning methods
Learning on text The most obvious mapping is: Each document is an input element Each word is a possible feature Huge dimensionality (order of hundred of thousands words) need sparse representations
Determining features Apply pre-processing pipeline of search Join tokens when needed (e.g., “AK-48”, part numbers, chemical formulas, etc.) May need to emphasize words in title, abstracts, or section headings One option: multiply input dimensionality by number of existing blocks (“embryo” in title is completely unrelated to “embryo” in body) Another option: increase weight of title words, section headers, heuristically Term frequency not relevant for short messages
Training data is essential SVMs and Random forests are popular choices Very little training data: Naïve Bayes (However, I would say just get more training data) The amount of training data will vary during the learning cycle In practice: With a few hundred examples per class you already see that obvious examples are classified correctly With a few thousand examples per class less common cases start to be classified correctly
The devil is in the details Real systems may use automatic classification and a few carefully hand-crafted rules Real systems incorporate continuously new examples to maintain and improve performance Commonly you have unbalanced classes Need to get many examples of the minority class, can obtain them by keyword filtering, but that biases the training data (harms generative models)
Evaluating Evaluation can be done on a hold-out set If more data is becoming available … how do we know our classifier is performing better? Cross-validation Fixed assignment to test or hold-out (validation)
Cross-validation Divide sample into n “folds” 5 in this example For k = 1 … n Train on all folds except fold k Test on fold k Average n runs → result
With unbalanced classes, accuracy becomes meaningless Need to analyze confusion matrix Example: classes are { uk, poultry, …, trade }
Micro- and Macro-Average Micro-Average Evaluate every item separately Macro-Average Evaluate each class separately, then average
Micro- and Macro-Average (cont.)