Automated Personality Classification A. KARTELJ and V. FILIPOVIC School of Mathematics, University of Belgrade, Serbia and V. MILUTINOVIC School of Electrical Engineering, University of Belgrade, Serbia
Agenda Problem overview Classification of the existing solutions Presentation of the existing solutions Comparison of the solutions Work in progress: Bayesian Structure Learning for the APC Future work: Video Based APC Conclusions 3.10.2012 MULTI 2012
Problem Overview 3.10.2012 MULTI 2012
The Big 5 Model Openness to experience – (inventive/curious vs. consistent/cautious). Appreciation for art, emotion, adventure, unusual ideas, curiosity, and variety of experience. Openness reflects the degree of intellectual curiosity, creativity and a preference for novelty and variety. Some disagreement remains about how to interpret the openness factor, which is sometimes called "intellect" rather than openness to experience. Conscientiousness – (efficient/organized vs. easy-going/careless). A tendency to show self-discipline, act dutifully, and aim for achievement; planned rather than spontaneous behavior; organized, and dependable. Extraversion – (outgoing/energetic vs. solitary/reserved). Energy, positive emotions, surgency, assertiveness, sociability and the tendency to seek stimulation in the company of others, and talkativeness. Agreeableness – (friendly/compassionate vs. cold/unkind). A tendency to be compassionate and cooperative rather than suspicious and antagonistic towards others. Neuroticism – (sensitive/nervous vs. secure/confident). The tendency to experience unpleasant emotions easily, such as anger, anxiety, depression, or vulnerability. Neuroticism also refers to the degree of emotional stability and impulse control, and is sometimes referred by its low pole – "emotional stability". 3.10.2012 MULTI 2012
The Steps in Our Research Survey paper (under review at ACM CSUR) Research paper: A new APC model based on Bayesian structure learning (in progress) Real-purpose application of the APC model from step 2 Go to step 3 3.10.2012 MULTI 2012
Elements of APC Corpus: Personality measurement: Model: Essay, weblog, email, news group, Twitter counts... Personality measurement: Questionnaire (internet and written). We are searching for an alternative! Model: Stylistic analysis, linguistic features, machine learning techniques 3.10.2012 MULTI 2012
Applications Social networks – friend suggestions, dating sites (finding compatible partners) Youtube, TripAdvisor, Google, eBay – personality based recommendations Customer targeting, advertisement Other usages – police, anti-terrorism etc. 3.10.2012 MULTI 2012
Mining People’s Characteristics Authorship – who is an author of some non-signed piece of text? Gender – is an author male or female? Mood, emotions – emotions conveyed through text? Opinion – mining opinion from text (positive, negative, …)? Personality 3.10.2012 MULTI 2012
Classification of Solutions C1 criterion separates solutions by type of conversation (1 = self-reflexive, N = continuous) C2 criterion separates solutions by approach (TD = top-down, DD = data-driven, or HY = hybrid) 3.10.2012 MULTI 2012
Linguistic Styles: Language Use as an Individual Difference Pennebaker and King [1999] 3.10.2012 MULTI 2012
LIWC and MRC Features Feature Type Example Anger words LIWC Hate, kill Metaphysical issues God, heaven, coffin Physical state / function Ache, breast, sleep Inclusive words With, and, include Social processes Talk, us, friend Family members Mom, brother, cousin Past tense verbs Walked, were, had References to friends Pal, buddy, coworker Imagery of words MRC Low: future, peace – High: table, car Syllables per word Low: a – High: uncompromisingly Concreteness Low: patience, candor – High: ship Frequency of use Low: duly, nudity – High: he, the LIWC dictionary that represents a part of the text analysis framework LIWC (Linguistic Inquiry and Word Count) developed by Pennebaker et al. [2001]. LIWC categorizes words into meaningful psychological categories. Coltheart [1981] proposed the MRC, a psycholinguistic database of words categorized by various linguistic features of text, such as: imagery, concrete- ness, frequency of usage, etc. 3.10.2012 MULTI 2012
What Are They Blogging About What Are They Blogging About? Personality, Topic and Motivation in Blogs Gill et al. [2009] 3.10.2012 MULTI 2012
Taking Care of the Linguistic Features of Extraversion Gill and Oberlander [2002] 3.10.2012 MULTI 2012
Personality Based Latent Friendship Mining Wang et al. [2009] 3.10.2012 MULTI 2012
A Comparative Evaluation of Personality Estimation Algorithms for the TWIN Recommender System Roshchina et al. [2011] 3.10.2012 MULTI 2012
Predicting Personality with Social Media Golbeck et al. [2011] 3.10.2012 MULTI 2012
Our Twitter Profiles, Our Selves: Predicting Personality with Twitter Quercia et al. [2011] 3.10.2012 MULTI 2012
M5’ rules, Gaussian processes 12 [Celli 2012] 1065 posts Paper Input Corpus Features Algorithm Soft. Cit. I S A R [Pennebaker and King 1999] text essays LIWC correlations n/a 455 H M [Mairesse et al. 2007] text, speech LIWC, MRC C4.5, NB, SMO, M5’ Weka 99 [Gill et al. 2009] weblogs (14.8words) linear regression 26 [Yarkoni 2010] weblogs (100K words) 21 [Gill and Oberlander 2002] emails (105 students) bigrams bigram analysis 49 L [Nowson et al. 2005] weblogs (410K words) word list 48 [Oberlander 2006] weblogs (410K words) N-grams NB, SMO 53 [Wang et al. 2009] text, weblogs (200 pairs) lexical freq. ,TFIDF logistic regression Minitab 1 [Iacobelli et al. 2011] weblogs (3000) LIWC, bigrams, SVM, SMO, NB.. [Argamon et al. 2005] word list, conj. SMO 38 [Argamon et al. 2007] Weka, ATMan 45 [Mairesse and Walker 2006] text , conv. extracts 96 persons (≈100Kwords) LIWC, MRC, utterance… RankBoost 22 [Rigby and Hassan 2007] mail. lists (140K emails) C4.5 Weka, SPSS 30 [Roshchina et al. 2011] TripAdvisor reviews LIWC, MRC Linear, M5, SVM 2 [Quercia et al. 2011] meta 335 Twitter users Twitter counts M5’ rules 5 [Golbeck et al. 2011] text, meta 279 FB users 5 classes (161 in total) M5’ rules, Gaussian processes 12 [Celli 2012] 1065 posts 22 ling. Features majority-based classification I – implementation cost S – scalability A – availability R – reliability 3.10.2012 MULTI 2012
Naive Bayes Classifier Naive Bayes, Oberlander [2006] 3.10.2012 MULTI 2012
Naive Bayes and Bayesian Network 3.10.2012 MULTI 2012
Bayesian Network for the APC 3.10.2012 MULTI 2012
Bayesian Network Structure Learning Obtain corpus (training set T) Fit T to appropriate network structure by: ILP formulation + solver (CPLEX, Gurobi…) on smaller instances Apply metaheuristic on larger instances Validate quality of metaheuristic approach Compare obtained APC accuracy with other approaches 3.10.2012 MULTI 2012
Other Ideas Games with a purpose (GWAP) Clustering personality characteristics 3.10.2012 MULTI 2012
Packing everything together: Video Based APC 3.10.2012 MULTI 2012
Conclusions Classification of the existing solutions (Survey paper) Filling the gaps inside classification tree Introducing Bayesian Structure Learning for the APC Utilizing metaheuristics in dealing with high dimensionality APC potential: social networks, recommender, and expert systems 3.10.2012 MULTI 2012
THANK YOU! Aleksandar Kartelj kartelj@matf.bg.ac.rs Vladimir Filipovic vladaf@matf.bg.ac.rs Veljko Milutinovic vm@etf.bg.ac.rs