Download presentation
Presentation is loading. Please wait.
Published byLaurence Jordan Modified over 9 years ago
1
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu
2
Twitter 2 to 3 new users/second!! 175 million users as of today 95 million tweets written each day 300 employees and hiring! “what are you doing” “what’s happening” Daily chatter, URL sharing, news – Distributed/local reporting; no filters; no editing – Fastest, lowest level information service Not a social network but an information network
3
Aggregators & others Plenty: – Tweetmeme Separates the kinds of media linked to (video, image..) – Twitter search – Twistori Aggregates emotions – Tweetsentiments.com – Retail Twitter Aggregation – TweetTabs – Where, what, when
4
Finding ‘Emergent’ Topics What is it? – A topic popular in current time and not in the past 5 steps: – Represent tweets as vectors of terms (language independent) – Graph active authors’ social relationships (PageRank) – Model each term’s life cycle: “novel aging theory” – Rank terms based on their “energy”, select top few – Create a navigable topic graph linking emerging terms with co-occurring ones: Emerging Topics.
5
Cool points Nice overview with enough details about text representation, processing etc. Hypothesis: flow of ideas from geographical origin of event to outside – So find the starting tweets and you find the locale – Always so? Global event? Disasters? See PageRank in action – Author network in particular SCC – Term networks Biological metaphor – Lots of different terms: nutrition, calories, energy… Reasonable case study Language – humour – heartquake – Not dumping factor
6
Some trends
7
Last Class Continuation Crawler Evaluation What are good pages? Web scale is daunting User based crawls are short, but web agents? Page importance assessed – Presence of query keywords – Similarity of page to query/description – Similarity to seed pages (held out sample) – Use a classifier – not the same as used in crawler – Link-based popularity (but within topic?)
8
Summarizing Performance Precision – Relevance is Boolean: yes/no Harvest rate: # of good pages/total # pages – Relevance is continuous Average relevance over crawled set – Recall Target recall: held out seed pages (H) – |H ∧ pages crawled|/|pages crawled| Robustness – Start same crawler on disjoint seed sets. Examine overlap of fetched pages
9
Sample Performance Graph
10
Summary Crawler architecture Crawler algorithms Crawler evaluation Assignment 1 – Run two crawlers for 5000 pages. – Start with the same set of seed pages for a topic. – Look at overlap and report this over time (robustness)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.