Download presentation
Presentation is loading. Please wait.
Published byTodd Sparks Modified over 6 years ago
1
A Straightforward Author Profiling Approach in MapReduce
Suraj Maharjan, Prasha Shrestha, Thamar Solorio and Ragib Hasan
2
? Introduction Author profiling
identifying age-group, gender, native language, personality and other aspects that constitute the profile of an author, by analyzing his/her writings ?
3
Applications Forensics and Security
Threatening s containing spam/malwares Authorship attribution of old texts Not possible to check against every author Profiling narrows down the list Marketing Find out demographics/profiles of customers who either love or hate their product. Find out target group of customers.
4
Dataset Age Gender English Spanish Training Early Bird Test 10s male
8600 740 888 1250 120 144 female 20s 42900 3840 4608 21300 1920 2304 30s 66800 6020 7224 15400 1360 1632 Total 236600 21200 25440 75900 6800 8160 `Balanced across gender; imbalanced across age Table 1: PAN’13 Training, Early bird, Test documents distribution.
5
Dataset English 1.8 GB 135 MB 168 MB Spanish 384 MB 34 MB 40 MB Total
Language Training Early Bird Test English 1.8 GB 135 MB 168 MB Spanish 384 MB 34 MB 40 MB Total 2.2 GB 169 MB 209 MB Table 2: PAN’13 Data Size
6
Why MapReduce? Provides abstraction for designing distributed algorithms compared to MPI, p-threads No need to worry about deadlocks, race conditions machine failures
7
Methodology Preprocessing Features Classification Algorithm
Sequence File Creation Tokenization DF Calculation Filter Features Word n-grams (unigrams, bigrams, trigrams) Weighing Scheme: TF-IDF Classification Algorithm Naïve Bayes
8
Tokenization Job Input Map Output
Key: filename (<authorid>_<lang>_<age>_<gender>.xml) Value: content Map Lowercase 1, 2, 3-grams (Lucene) Output Key: filename Value: list of ngram tokens
9
Tokenization Remove xml and html tags from documents
<authorid>_<lang>_<age>_<gender>.xml Preprocessing Filename Content Filename 1,2,3 grams Sequence Files F1 “A B C C” F2 “B D E A” F3 “A B D E” F4 “C C C” F5 “B F G H” F6 “A E O U” F1 [A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map Tokenization job
10
DF Calculation Job Mapper Reducer Combiner
Map(“<authorid>_<lang>_<age>_<gender>.xml”,[a ab abc ..]) ->list(token,1) Similar to word count but need to emit only unique token, once per document Reducer Reduce(token, list(1,1,1..)) –>(token, df_count) Combiner Minimizes network data transfer
11
DF Calculation Job Reduce Token 1 Token 1 Token DF count DF count job
… 1 B D C A 1 … B F G .. A 4 B C 2 D E 3 F 1 DF count job Group By
12
Filter Job Filters out the least frequent tokens base on DF scores
Build dictionary file Maps each token with unique integer id Mapper Setup Read in the dictionary file, df scores Map map(“<authorid>_<lang>_<age>_<gender>.xml”, token list) -> (“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)
13
Filter Job Filter count DF count Filename 1,2,3 grams Map Filter job
[A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map F1 [A, B, C, C, A B, ..] F2 [B, D, E, A, B D E, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, …] F5 [B, H, …] F6 [A, E,…] Filter job
14
TF-IDF Job Mapper Setup: Map: Read in dictionary and DF score files
Map(“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)- >(“<authorid>_<lang>_<age>_<gender>.xml”, VectorWritable) Compute tf-idf scores for each token Creates RandomSparseVector(mahout-math) Finally writes vectors
15
Training Naïve Bayes (MR)
Translate TF-IDF vectors into LibLinear feature file Use LibLinear to train Logistic Regression Liblinear is also very fast
16
Naïve Bayes MR Mapper Reducer Setup Map CleanUp
Initialize a prior probability vector Map Compute partial sums and store it in prior probability vector Map(“authorid_age_gender.xml”,Vector)->(id for age_gender_class,Vector) CleanUp Emit(-1,prior probability vector) Reducer Vector sum reducer (mahout)
17
Experiments Local Hadoop cluster with 1 master node and 7 slave nodes
Each node has 16 cores and 12 GB memory Modeled as 6 class classification problem
18
Results Number of features in English : 2,908,894
Number of features in Spanish: 2,806,922 Language System Total (%) Age (%) Gender (%) English Baseline 28.4 56.8 50 PAN 2013 English Best 38.94 59.21 64.91 PAN 2013 Overall Best 38.13 56.9 65.72 Ours (Test) 42.57 65.62 61.66 Ours (Early Bird) 43.88 65.8 62.57 Spanish 28.23 56.47 PAN 2013 Spanish Best 42.08 64.73 64.3 41.58 62.99 65.58 40.32 61.73 64.63 40.62 62.1 64.79 Table 3: Accuracy
19
Results Accuracy(%) Filter at DF # of features 10 2,908,894 42.57 15
1,885,410 42.37 20 1,403,139 42.26 100 276,902 41.93 200 136,999 37.34 300 90,836 17.23 400 67,938 7.8 500 54,330 5.93 1000 27,151 4.66 Table 4: Accuracy on test dataset compared with \# of features for English
20
Results Parameter Setting Accuracy (%) N-gram
Character Bigrams and Trigrams 31.2 %word uni, bi and tri grams TFIDF 42.57 Word Unigrams 39.99 Word Bigrams 41.17 Word Trigrams 39.7 Stopword Unigrams 29.14 Weighing Scheme TF 36.12 Preprocessing Stopwords filtered and Porter Stemming 35.82 Table 5: Accuracy comparison for different parameter settings for English dataset
21
Results Time (Minutes) Figure 1: Runtimes for English
22
Results Time (Minutes) Figure 2: Runtimes for Spanish
23
Conclusion and Future Works
Word n-grams proved to be better features than character n-grams for this task Stop words proved to be good features TF-IDF is better than TF Simple approaches do work Filter out spam like documents Add more sophisticated features and make them work in MapReduce
24
Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.