Blog Data Analysis S. Muthukrishnan, CS Rutgers & DIMACS Graham Cormode, DIMACS
Weblog data "Weblogs" blanket term for regularly updated on-line journals Usually informal, opinionated, candid: more like than web Many millions of "blogs" created with free tools and websites Published as web pages
RSS XML structured document, representation of the blog More structured than HTML, indicated title, timestamp, permalink, content etc. of posts Enables easy checking of updates to blogs But… may not contain whole content Not all blogs RSS feed available? Given a blog, how to find accompanying RSS feed automatically?
Different blog systems Hosted Blogs Blogger / Blogspot (owned by google) Livejournal Myspace (owned by NewsCorp) Xanga? Others… Blog management systems TypePad WordPress MovableType
Blogging Ecosystem RSS readers Bloglines Google reader / yahoo blog reader? Blog metadata Blogpulse Technorati Others…
Collection and Analysis Automatically collect blogs, strip formatting and tags, ads etc. Output "bag of words" into streaming algorithms for analysis, archival. So far: 900,000 blogs, 10GB compressed. Scale to 100s of GBs To Do: extract more meta-data (time of posting, title, links etc.), per-blog analysis, retroactive analysis... Preetham Mysore, Claudio Tancioni
Early Results Began building systems June Extracted most common terms on 1GB using streaming analysis New blog stopwords Multilingual Non-standard word distribution: "love" vs "war" used only ~ 26KB. 3000:1 compaction
Blog Statistics Top Weblog Hosts 1. blogspot (418803) 2. livejournal (342265) 3. xanga (187021) 4. diaryland (71649) 5. persianblog (59645) Blog Languages 1. English (70%) 2. Portuguese (4.5%) 3. Farsi (3.2%) 4. Polish (2.8%) 5. French (1.8%) 6. Spanish (1.1%) 7. German (1.0%) 8. Chinese (0.7%) 9. Italian (0.5%) 10. Dutch (0.4%) Top Blog Nouns Current rank (last month) 6. (8) love 29. (37) school 31. (45) friends 34. (46) music 41. (59) fun 58. (60) god 65. (89) happy 79. (47) news 146. (175) movie 156. (88) war 166. (167) money 168. (142) book 171. (183) family 173. (190) car 186. (211) mom 234. (118) bush 253. (172) iraq
Homework 1 page or more write up for each item: Survey different blogging sites, blog formats/templates and blog data collection mechanisms. Survey RSS feed mechanism for blog data List methods for “reverse” links for a given blog – how to find who links to a blog? How to estimate the number of blogs in the world, and the number of blogs not hosted by well-known blogging sites (LJ, blogger etc.)?