Download presentation
Presentation is loading. Please wait.
Published byKenneth Lane Modified over 9 years ago
1
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching, Malaysia
2
Social Media and Personal Data Dec 5, 2014AIRS 20142 Much personal information revealed in social media –Content, links, ratings personal preferences All this information is useful to –Researchers: social science –Businesses: targeted advertising
3
User Biographies in Twitter Dec 5, 2014AIRS 20143 Self-introductions written in free form Reflect users’ background and interests
4
User Biographies in Twitter 4 profession interests age Around 28% of Singapore Twitter users and 50% of US Twitter users revealed their personal interests in their biographies. Dong Wei et. al. Who am I on Twitter?: A cross-country comparison. WWW’2014 Dec 5, 2014AIRS 2014
5
Outline Background Our task Syntactic patterns of interest tags Build training data + gold standard Method Experiments Summary 5 Dec 5, 2014AIRS 2014
6
Our task Automatically extract phrases that describe a user’s personal interests. –We call them “interest tags” –A typical information extraction problem. –Automatically build training data based on common syntactic patterns. 6 Dec 5, 2014AIRS 2014
7
Method Linear Chain CRF BIO labels 7 Dec 5, 2014AIRS 2014
8
Syntactic Patterns of Interest Tags 8 Based on manual annotation of 500 user biographies. 28.8% of user biographies contain meaningful interest tags. Dec 5, 2014AIRS 2014
9
Building Training Data Seed patterns: –Play + [NP] –[NP] + fan –Interested in + [NP] Steps: –Use seed patterns to extract noun phrases and rank them according to their frequency –Pick the top-100 ranked noun phrases and use them as positive instances to train CRF 9 Dec 5, 2014AIRS 2014
10
Features Syntactic or dependency features are not used as the Twitter text is noisy for parsing Both lexical and POS tag feature are used To avoid over-fitting: only features extracted from the surrounding tokens for each position are used 10 Dec 5, 2014AIRS 2014
11
Gold Standard Two annotators: graduate students 500 randomly sampled user biographies 1190 sentences –Two annotators disagree on 10 sentences –High agreement 11 Dec 5, 2014AIRS 2014
12
Experiment 12 BL-700: top 700 frequent phrases, we choose 700 because it gets the highest F-score among various numbers. Seed: use seed patterns to recognize interest tags Dec 5, 2014AIRS 2014
13
Extracted Patterns 13 Dec 5, 2014AIRS 2014 Some popular patterns are: [Interest tag] + fan/lover/enthusiast I love + [interest tag] [interest tag] is/are my life
14
Is it difficult to predict interest tags by users’ tweets? 14 Dec 5, 2014AIRS 2014
15
Is it difficult to predict interest tags by users’ tweets? We also applied Tf-idf ranking, which has been used to extract personalized user tags, to extract user interest tags. 15 Dec 5, 2014AIRS 2014 Interest tags extracted from user’s biographies are not necessarily reflected in a user’s post tweets. They can work as supplementary information when profiling a user.
16
Summary We studied the problem of extracting interest tags from Twitter user biographies We automatically built noisy training data based on syntactic patterns We trained CRF classifier on the noisy training data and achieved decent performance Interest tags extracted from Twitter user biographies may not be reflected in user’s tweets 16 Dec 5, 2014AIRS 2014
17
Thank you! Questions? 17 Dec 5, 2014AIRS 2014
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.