Download presentation
Presentation is loading. Please wait.
1
Blog Data Analysis S. Muthukrishnan, CS Rutgers & DIMACS Graham Cormode, DIMACS
2
Weblog data "Weblogs" blanket term for regularly updated on-line journals Usually informal, opinionated, candid: more like email than web Many millions of "blogs" created with free tools and websites Published as web pages
3
RSS XML structured document, representation of the blog More structured than HTML, indicated title, timestamp, permalink, content etc. of posts Enables easy checking of updates to blogs But… may not contain whole content Not all blogs RSS feed available? Given a blog, how to find accompanying RSS feed automatically?
4
Different blog systems Hosted Blogs Blogger / Blogspot (owned by google) Livejournal Myspace (owned by NewsCorp) Xanga? Others… Blog management systems TypePad WordPress MovableType
5
Blogging Ecosystem RSS readers Bloglines Google reader / yahoo blog reader? Blog metadata Blogcensus.net Blogpulse Technorati Others…
7
Collection and Analysis Automatically collect blogs, strip formatting and tags, ads etc. Output "bag of words" into streaming algorithms for analysis, archival. So far: 900,000 blogs, 10GB compressed. Scale to 100s of GBs To Do: extract more meta-data (time of posting, title, links etc.), per-blog analysis, retroactive analysis... Preetham Mysore, Claudio Tancioni What I Want For WHAT Joel Spolsky satisfyingly nails a bunch of ways to improve client-side web app development which the WHAT Working Group should work on. All his suggestions are excellent and well worth looking over, even if some seem to require the same "boiling the ocean" that he doesn't want to hear about. That said, most of his list could probably be done with a good set of Javascript libraries, along the lines of Dean Edwards's IE7, and his #2 (fast REST queries back to the server in JS) is pretty much - well, almost - with us already, looking at combinations of things like XMLHttpRequest and mod_pubsub. But anyway, he ended his piece with a call for more suggestions that he could link to. I've been doing a little bit of browser app development in the last few days, and these are the things that spring most readily to mind:
8
Early Results Began building systems June 2004. Extracted most common terms on 1GB using streaming analysis New blog stopwords Multilingual Non-standard word distribution: "love" vs "war" used only ~ 26KB. 3000:1 compaction
9
Blog Statistics Top Weblog Hosts 1. blogspot (418803) 2. livejournal (342265) 3. xanga (187021) 4. diaryland (71649) 5. persianblog (59645) Blog Languages 1. English (70%) 2. Portuguese (4.5%) 3. Farsi (3.2%) 4. Polish (2.8%) 5. French (1.8%) 6. Spanish (1.1%) 7. German (1.0%) 8. Chinese (0.7%) 9. Italian (0.5%) 10. Dutch (0.4%) Top Blog Nouns Current rank (last month) 6. (8) love 29. (37) school 31. (45) friends 34. (46) music 41. (59) fun 58. (60) god 65. (89) happy 79. (47) news 146. (175) movie 156. (88) war 166. (167) money 168. (142) book 171. (183) family 173. (190) car 186. (211) mom 234. (118) bush 253. (172) iraq
10
Homework 1 page or more write up for each item: Survey different blogging sites, blog formats/templates and blog data collection mechanisms. Survey RSS feed mechanism for blog data List methods for “reverse” links for a given blog – how to find who links to a blog? How to estimate the number of blogs in the world, and the number of blogs not hosted by well-known blogging sites (LJ, blogger etc.)?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.