© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Background Nielsen BuzzMetrics aggregates consumer opinion expressed in message boards, weblogs, Usenet and other online discussions Parent company behind BlogPulse, blog search and analytics website
© 2006 Nielsen BuzzMetrics, A VNU business affiliate What drives weblog spam? Same goal as any other website spam: SEO Weblog hosts provide: Free hosting for link farms to promote affiliate sites Free hosting for web pages with sponsored ads Types of weblog spam spam blogs – (pollute ping servers) spam comments on legitimate blogs spam trackback pings to legitimate blogs
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Collateral damage: blog search result contamination Search results for ‘mortgage’ :
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Collateral damage: trend graphs Explain the peaks: are they real or artifacts of spam?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Collateral damage: real-time monitoring Spikes in keyword clusters 2006/07/28 10:39 a.m. {deleted myspace account} 2006/07/27 10:55 a.m. {landis tested yesterday} 2006/08/07 3:22 a.m. {investing debt directory} 2006/08/07 6:54 a.m. {adsense cents makers} 2006/08/07 1:11 p.m. {wwdc keynote} Breaking news or spam attack?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Spam filtering challenges Different analytics, different trade-offs weblog search requirements: high coverage, clean results, minimize false positives trend search: high precision to eliminate spurious artifacts real-time monitoring: high coverage w/human oversight Different timeframes, different approaches real-time search: highly efficient classification algorithms; automated identification of spam attacks historic search: offline spam identification can use combination of approaches; sandbox for new weblogs