Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

The objective of an Entity Recognition and Disambiguation (ERD) system is to recognize mentions of entities in a given text, disambiguate them, and map.

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

Large-Scale Entity-Based Online Social Network Profile Linkage.

Entity Tracking in Real- Time using Sub-Topic Detection on Twitter SANDEEP PANEM, ROMIL BANSAL, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF.

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Modeling the Evolution of Product Entities “Newer Model" Feature on Amazon Paper ID: sp093 1.Product search engine ranking 2.Recommendation systems 3.Comparing.

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Relevant words extraction method for recommender system Presentation slides.

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

A Language Independent Method for Question Classification COLING 2004.

INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .

Mining fuzzy domain ontology based on concept Vector from wikipedia category network.

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

Tweets Discrimination Analysis

Text Annotation By: Harika kode Bala S Divakaruni.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Predicting Leadership Roles in Workgroups Vitor R. Carvalho, Wen Wu and William W. Cohen Carnegie Mellon University CEAS-2007, Aug 2 nd 2007.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

A Simple Approach for Author Profiling in MapReduce

Learning to Detect and Classify Malicious Executables in the Wild by J

A Straightforward Author Profiling Approach in MapReduce

Source: Procedia Computer Science（2015）70:

EDIUM: Improving Entity Disambiguation via User modelling

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Presentation transcript:

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

Real World Problems Gender? Age?Personality? Native Language? Profession? Predicting Latent User Attributes from Text

Why? ●Forensics : Language as evidence. ●Marketing : Recommend products. ●Query Expansion : Suggest queries based on attributes. ●Mapping different social media profiles of a user : Latent attributes can be used as evidence.

Attributes considered Age? Gender?

Previous Approaches ●Explored contextual and stylistic differences between different classes. ●Content based features (word n-grams) and style based features (Parts of Speech n- grams) were used.

Drawbacks ●Ignored semantic relation between words. ●Could not handle polysemy.

Our Contributions Enhanced the document representation using two new features. ●Wikipedia concepts found in the text ●Parent categories of these Wikipedia concepts

System Overview Training Docs Preprocess Entity Linking Category Extraction Feature Representation Preprocess Entity Linking Category Extraction KNN or SVM Model Top K Documents Extract Profiles Age Gender Test Doc Feature Representation

●Preprocessing Data o The text from blogs is preprocessed to remove unwanted content. ●Entity Linking o TAGME is used to find Wikipedia concepts in text. o It uses anchor text found in Wikipedia as spots and pages linked to them in Wikipedia as their possible senses. o Polysemy problem is handled Semantic Representation of Documents (1)

Semantic Representation of Documents (2) ●Finding Parent Categories for Wikipedia Concepts o Parent categories of wikipedia concepts up to five levels are extracted. o Wikipedia category network using Wikipedia category corpus is created. o Semantically related words get mapped to the same Wikipedia categories at various levels

Age and Gender Prediction Two Machine Learning classification models used ●K Nearest Neighbour (KNN). ●Support Vector Machines (SVM).

Dataset ●Datasets used for training and testing are provided by PAN ●Datasets are available at linklink

KNN ●Boost factor for each field c is learnt using

KNN ●Figures on the previous slide show that each of the features are important for the prediction task. ●On validation data, we obtained best accuracy at k=5 for gender prediction and k=7 for age prediction. Hence, these values of k are used for testing.

SVM ●Along with Wikipedia concepts and categories found in text, the following features are also used o Content based features: n-gram words upto tri- grams are used. o Style features: POS n-gram upto tri-grams are used.

Results FeaturesClassifierGenderAge Wikipedia semanticKNN Wikipedia semanticSVM Word n-gramsSVM POS n-gramsSVM Wikipedia semantic + Word n-gramsSVM Wikipedia semantic + POS n-gramsSVM Wikipedia semantic + Word n-grams + POS n-gramsSVM Meina et al.Random Forests

●Document representation is leveraged using Wikipedia concepts and category information ●Experimental results show that the proposed approach beats the best approach for a similar task at CLEF Conclusion

●By enhancing the entity linking part of the proposed system, overall accuracy of the age and gender prediction can be further improved.