A Simple Approach for Author Profiling in MapReduce

Slides:



Advertisements
Similar presentations
Farag Saad i-KNOW 2014 Graz- Austria,
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
The identification of interesting web sites Presented by Xiaoshu Cai.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
By: Shannon Silessi Gender Identification of SMS Texts.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Big Data Processing of School Shooting Archives
Language Identification and Part-of-Speech Tagging
Image taken from: slideshare
Using Social Media to Enhance Emergency Situation Awareness
Sentiment Analysis of Twitter Messages Using Word2Vec
A Straightforward Author Profiling Approach in MapReduce
MID-SEM REVIEW.
Natural Language Processing of Knee MRI Reports
The Basics of Apache Hadoop
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Machine Learning with Weka
Text Categorization Assigning documents to a fixed set of categories
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Word embeddings based mapping
Word embeddings based mapping
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Clinically Significant Information Extraction from Radiology Reports
Introduction to Sentiment Analysis
Text Analytics Solutions with Azure Machine Learning
Natural Language Processing Is So Difficult
Extracting Why Text Segment from Web Based on Grammar-gram
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan, Prasha Shrestha, and Thamar Solorio

Introduction Task Given an anonymous document Predict Age group [18-24 | 25-34 | 35-49 | 50-64 | 65-plus] Gender [Male | Female] Provided: Training data in English and Spanish English – Blog, Reviews, Social Media, and Twitter Spanish – Bog, Social Media, and Twitter

Applications Forensics and Security Threatening email Emails containing spam/malwares Authorship attribution of old texts Not possible to check against every author Profiling narrows down the list Marketing Find out demographics/profiles of customers who either love or hate their product. Find out target group of customers.

Motivation Started experimenting with PAN’13 data PAN’13 dataset 1.8 GB of training data for English 384 MB of training data for Spanish Explored MapReduce for fast processing of huge amount of data

Table 1: Training data distribution. Category English Spanish Files Size (MB) Blog 147 7.6 88 8.3 Reviews 4160 18.3 - Social Media 7746 562.3 1272 51.9 Twitter 306 104.0 178 85.0 Total 12359 692.2 1538 145.2 Table 1: Training data distribution.

Methodology Preprocessing Features Classification Algorithm Sequence File Creation Tokenization DF Calculation Filter Features Word n-grams (unigrams, bigrams, trigrams) Weighing Scheme: TF-IDF Classification Algorithm Logistic Regression with L2 norm regularization

Tokenization Job Input Map Output Key: filename (<authorid>_<lang>_<age>_<gender>.xml) Value: content Map Lowercase CMU ARK Tweet Part-of-Speech Tagger 1, 2, 3-grams (Lucene) Output Key: filename Value: list of ngram tokens

Tokenization Remove xml and html tags from documents <authorid>_<lang>_<age>_<gender>.xml Preprocessing Filename Content Filename 1,2,3 grams Sequence Files F1 “A B C C” F2 “B D E A” F3 “A B D E” F4 “C C C” F5 “B F G H” F6 “A E O U” F1 [A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map Tokenization job

DF Calculation Job Mapper Reducer Combiner Map(“<authorid>_<lang>_<age>_<gender>.xml”,[a ab abc ..]) ->list(token,1) Similar to word count but need to emit only unique token, once per document Reducer Reduce(token, list(1,1,1..)) –>(token, df_count) Combiner Minimizes network data transfer

DF Calculation Job Reduce Token 1 Token 1 Token DF count DF count job … 1 B D C A 1 … B F G .. A 4 B C 2 D E 3 F 1 DF count job Group By

Filter Job Filters out the least frequent tokens base on DF scores Build dictionary file Maps each token with unique integer id Mapper Setup Read in the dictionary file, df scores Map map(“<authorid>_<lang>_<age>_<gender>.xml”, token list) -> (“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)

Filter Job Filter count DF count Filename 1,2,3 grams Map Filter job [A, B, C, C, A B, A B C, ..] F2 [B, D, E, A, B D E, E A, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, C C C, …] F5 [B, F, G, H, G H, …] F6 [A, E, O, U, E O, A U,…] Map F1 [A, B, C, C, A B, ..] F2 [B, D, E, A, B D E, ..] F3 [A, B, D, E, B D, B D E, …] F4 [C, C, C, C C, …] F5 [B, H, …] F6 [A, E,…] Filter job

TF-IDF Job Mapper Setup: Map: Read in dictionary and DF score files Map(“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)- >(“<authorid>_<lang>_<age>_<gender>.xml”, VectorWritable) Compute tf-idf scores for each token Creates RandomSparseVector(mahout-math) Finally writes vectors

Training Trained on : Final model uses LibLinear’s logistic regression Naïve Bayes (MR) Cosine Similarity (MR) Weighted Cosine Similarity (MR) Logistic regression (LibLinear) SVM (LibLinear) Final model uses LibLinear’s logistic regression Translate TF-IDF vectors into LibLinear feature file Use LibLinear to train Logistic Regression Liblinear is also very fast

Experiments Local Hadoop cluster with 1 master node and 7 slave nodes Each node has 16 cores and 12 GB memory Training data split into 70:30 ratio for training and development Modeled as 10 class classification problem

Classification Algorithm Classification Algorithm Experiments Classification Algorithm English (%) Spanish (%) Blogs Reviews Social Media Twitter Blog Naïve Bayes 27.50 21.55 20.62 28.89 55.00 20.48 34.78 Cosine similarity 20.00 23.64 19.72 27.78 35.00 26.33 36.96 Weighted Cosine Similarity 30.00 23.16 19.97 26.67 40.00 22.07 32.61 Logistic Regression 23.08 33.33 25.80 SVM 25.00 22.28 19.80 32.22 Table 2: Accuracy for word 1, 2, 3 -grams for cross validation dataset. Classification Algorithm English (%) Spanish (%) Blogs Reviews Social Media Twitter Blog Naïve Bayes 25.00 18.99 18.33 24.44 40.00 19.68 23.91 Cosine similarity 20.00 21.63 17.90 30.00 50.00 21.81 26.09 Weighted Cosine Similarity 21.15 16.78 23.33 28.26 Logistic Regression 22.50 21.71 25.56 35.00 23.67 17.39 SVM 20.83 15.92 23.14 Table 3: Accuracy for character 2, 3 -grams for cross validation dataset.

Classification Algorithm Experiments Separate Model: Different models for blog, social media, twitter and reviews per language Single Model: A single, combined model for each language Classification Algorithm English (%) Spanish (%) Separate Models Single Model Naïve Bayes 21.21 20.13 23.53 21.04 Cosine similarity 19.89 17.34 27.83 27.60 Weighted Cosine Similarity 21.32 18.18 23.98 24.89 Logistic Regression 21.83 21.92 26.92 28.96 SVM 20.99 20.48 27.37 28.05 Table 4: Accuracy for single and separate models for all categories.

Table 5: Accuracy comparison with other systems. Results Number of features in English : 7,299,609 Number of features in Spanish: 1,154,270 System Average Accuracy(%) PAN’14 Best 28.95 Ours 27.60 Baseline 14.04 Table 5: Accuracy comparison with other systems.

Results Table 6: Accuracy by category and language on test dataset. Both Age Gender Runtimes Runtime English Blog 16.67 25.00 54.17 00:01:50 23.08 38.46 57.69 0:01:56 Reviews 20.12 28.05 62.80 00:01:46 22.23 33.31 66.87 0:02:13 Social Media 20.09 36.27 53.32 00:07:18 20.62 36.52 53.82 0:26:31 Twitter 40.00 43.33 73.33 00:02:01 30.52 44.16 66.88 0:02:31 Spanish 28.57 42.86 57.14 00:00:35 46.43 0:00:39 30.33 40.16 68.03 00:01:13 28.45 42.76 64.49 0:03:26 61.54 69.23 88.46 00:00:43 61.11 65.56 0:01:10 Table 6: Accuracy by category and language on test dataset. 1st 2nd 3rd

Conclusion Word n-grams proved to be better features than character n-grams for this task MapReduce is ideal for feature extraction from large dataset Our system works better when there is a large dataset Simple approaches do work

Thank you.