Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu

Twitter 2 to 3 new users/second!! 175 million users as of today 95 million tweets written each day 300 employees and hiring! “what are you doing” “what’s happening” Daily chatter, URL sharing, news – Distributed/local reporting; no filters; no editing – Fastest, lowest level information service Not a social network but an information network

Aggregators & others Plenty: – Tweetmeme Separates the kinds of media linked to (video, image..) – Twitter search – Twistori Aggregates emotions – Tweetsentiments.com – Retail Twitter Aggregation – TweetTabs – Where, what, when

Finding ‘Emergent’ Topics What is it? – A topic popular in current time and not in the past 5 steps: – Represent tweets as vectors of terms (language independent) – Graph active authors’ social relationships (PageRank) – Model each term’s life cycle: “novel aging theory” – Rank terms based on their “energy”, select top few – Create a navigable topic graph linking emerging terms with co-occurring ones: Emerging Topics.

Cool points Nice overview with enough details about text representation, processing etc. Hypothesis: flow of ideas from geographical origin of event to outside – So find the starting tweets and you find the locale – Always so? Global event? Disasters? See PageRank in action – Author network in particular SCC – Term networks Biological metaphor – Lots of different terms: nutrition, calories, energy… Reasonable case study Language – humour – heartquake – Not dumping factor

Some trends

Last Class Continuation Crawler Evaluation What are good pages? Web scale is daunting User based crawls are short, but web agents? Page importance assessed – Presence of query keywords – Similarity of page to query/description – Similarity to seed pages (held out sample) – Use a classifier – not the same as used in crawler – Link-based popularity (but within topic?)

Summarizing Performance Precision – Relevance is Boolean: yes/no Harvest rate: # of good pages/total # pages – Relevance is continuous Average relevance over crawled set – Recall Target recall: held out seed pages (H) – |H ∧ pages crawled|/|pages crawled| Robustness – Start same crawler on disjoint seed sets. Examine overlap of fetched pages

Sample Performance Graph

Summary Crawler architecture Crawler algorithms Crawler evaluation Assignment 1 – Run two crawlers for 5000 pages. – Start with the same set of seed pages for a topic. – Look at overlap and report this over time (robustness)

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Similar presentations

Presentation on theme: "Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

Similar presentations

Presentation on theme: "Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences"— Presentation transcript:

Similar presentations

About project

Feedback