Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Straightforward Author Profiling Approach in MapReduce

Similar presentations


Presentation on theme: "A Straightforward Author Profiling Approach in MapReduce"— Presentation transcript:

1 A Straightforward Author Profiling Approach in MapReduce
Suraj Maharjan, Prasha Shrestha, Thamar Solorio and Ragib Hasan

2 ? Introduction Author profiling
identifying age-group, gender, native language, personality and other aspects that constitute the profile of an author, by analyzing his/her writings ?

3 Applications Forensics and Security
Threatening s containing spam/malwares Authorship attribution of old texts Not possible to check against every author Profiling narrows down the list Marketing Find out demographics/profiles of customers who either love or hate their product. Find out target group of customers.

4 Dataset Age Gender English Spanish Training Early Bird Test 10s male
8600 740 888 1250 120 144 female 20s 42900 3840 4608 21300 1920 2304 30s 66800 6020 7224 15400 1360 1632 Total 236600 21200 25440 75900 6800 8160 `Balanced across gender; imbalanced across age Table 1: PAN’13 Training, Early bird, Test documents distribution.

5 Dataset English 1.8 GB 135 MB 168 MB Spanish 384 MB 34 MB 40 MB Total
Language Training Early Bird Test English 1.8 GB 135 MB 168 MB Spanish 384 MB 34 MB 40 MB Total 2.2 GB 169 MB 209 MB Table 2: PAN’13 Data Size

6 Why MapReduce? Provides abstraction for designing distributed algorithms compared to MPI, p-threads No need to worry about deadlocks, race conditions machine failures

7 Methodology Preprocessing Features Classification Algorithm
Sequence File Creation Tokenization DF Calculation Filter Features Word n-grams (unigrams, bigrams, trigrams) Weighing Scheme: TF-IDF Classification Algorithm Naïve Bayes

8 Tokenization Job Input Map Output
Key: filename (<authorid>_<lang>_<age>_<gender>.xml) Value: content Map Lowercase 1, 2, 3-grams (Lucene) Output Key: filename Value: list of ngram tokens

9 Tokenization Remove xml and html tags from documents
<authorid>_<lang>_<age>_<gender>.xml Preprocessing Filename Content Filename 1,2,3 grams Sequence Files F1 “A B C C” F2 “B D E A” F3 “A B D E” F4 “C C C” F5 “B F G H” F6 “A E O U” F1 [A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map Tokenization job

10 DF Calculation Job Mapper Reducer Combiner
Map(“<authorid>_<lang>_<age>_<gender>.xml”,[a ab abc ..]) ->list(token,1) Similar to word count but need to emit only unique token, once per document Reducer Reduce(token, list(1,1,1..)) –>(token, df_count) Combiner Minimizes network data transfer

11 DF Calculation Job Reduce Token 1 Token 1 Token DF count DF count job
1 B D C A 1 B F G .. A 4 B C 2 D E 3 F 1 DF count job Group By

12 Filter Job Filters out the least frequent tokens base on DF scores
Build dictionary file Maps each token with unique integer id Mapper Setup Read in the dictionary file, df scores Map map(“<authorid>_<lang>_<age>_<gender>.xml”, token list) -> (“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)

13 Filter Job Filter count DF count Filename 1,2,3 grams Map Filter job
[A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map F1 [A, B, C, C, A B, ..] F2 [B, D, E, A, B D E, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, …] F5 [B, H, …] F6 [A, E,…] Filter job

14 TF-IDF Job Mapper Setup: Map: Read in dictionary and DF score files
Map(“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)- >(“<authorid>_<lang>_<age>_<gender>.xml”, VectorWritable) Compute tf-idf scores for each token Creates RandomSparseVector(mahout-math) Finally writes vectors

15 Training Naïve Bayes (MR)
Translate TF-IDF vectors into LibLinear feature file Use LibLinear to train Logistic Regression Liblinear is also very fast

16 Naïve Bayes MR Mapper Reducer Setup Map CleanUp
Initialize a prior probability vector Map Compute partial sums and store it in prior probability vector Map(“authorid_age_gender.xml”,Vector)->(id for age_gender_class,Vector) CleanUp Emit(-1,prior probability vector) Reducer Vector sum reducer (mahout)

17 Experiments Local Hadoop cluster with 1 master node and 7 slave nodes
Each node has 16 cores and 12 GB memory Modeled as 6 class classification problem

18 Results Number of features in English : 2,908,894
Number of features in Spanish: 2,806,922 Language System Total (%) Age (%) Gender (%) English Baseline 28.4 56.8 50 PAN 2013 English Best 38.94 59.21 64.91 PAN 2013 Overall Best 38.13 56.9 65.72 Ours (Test) 42.57 65.62 61.66 Ours (Early Bird) 43.88 65.8 62.57 Spanish 28.23 56.47 PAN 2013 Spanish Best 42.08 64.73 64.3 41.58 62.99 65.58 40.32 61.73 64.63 40.62 62.1 64.79 Table 3: Accuracy

19 Results Accuracy(%) Filter at DF # of features 10 2,908,894 42.57 15
1,885,410 42.37 20 1,403,139 42.26 100 276,902 41.93 200 136,999 37.34 300 90,836 17.23 400 67,938 7.8 500 54,330 5.93 1000 27,151 4.66 Table 4: Accuracy on test dataset compared with \# of features for English

20 Results Parameter Setting Accuracy (%) N-gram
Character Bigrams and Trigrams 31.2 %word uni, bi and tri grams TFIDF 42.57 Word Unigrams 39.99 Word Bigrams 41.17 Word Trigrams 39.7 Stopword Unigrams 29.14 Weighing Scheme TF 36.12 Preprocessing Stopwords filtered and Porter Stemming 35.82 Table 5: Accuracy comparison for different parameter settings for English dataset

21 Results Time (Minutes) Figure 1: Runtimes for English

22 Results Time (Minutes) Figure 2: Runtimes for Spanish

23 Conclusion and Future Works
Word n-grams proved to be better features than character n-grams for this task Stop words proved to be good features TF-IDF is better than TF Simple approaches do work Filter out spam like documents Add more sophisticated features and make them work in MapReduce

24 Thank you.


Download ppt "A Straightforward Author Profiling Approach in MapReduce"

Similar presentations


Ads by Google