Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and IBM What is memeta? Our framework that puts research into real world use Features blog identification and splog detection modules Includes Language Identification Modules, for more than 10 languages (provided by James Mayfield) memeta has been used on a need-to basis to analyze the blogosphere What is memeta? Our framework that puts research into real world use Features blog identification and splog detection modules Includes Language Identification Modules, for more than 10 languages (provided by James Mayfield) memeta has been used on a need-to basis to analyze the blogosphere 1. Welcome to the Splogosphere: 75% of pings are spings (splogs) Monitored a ping server – weblogs.com over a period of 3 weeks from 20 Nov 2005 to 11 Dec 2005 Total of 16 million update pings See 1 for ping distribution of URLs Pings were first classified into languages Blogs from Italian followed a predictable pattern – higher during the day Blogs from the English languages follows a similar pattern – not as obvious as Italian Splogs followed no pattern and number of pings were three times of authentic English blogs (2, 3) 1. Welcome to the Splogosphere: 75% of pings are spings (splogs) Monitored a ping server – weblogs.com over a period of 3 weeks from 20 Nov 2005 to 11 Dec 2005 Total of 16 million update pings See 1 for ping distribution of URLs Pings were first classified into languages Blogs from Italian followed a predictable pattern – higher during the day Blogs from the English languages follows a similar pattern – not as obvious as Italian Splogs followed no pattern and number of pings were three times of authentic English blogs (2, 3) 2. Characterizing the Splogosphere Blogosphere dump for 21 days of July 2005 1.3 million total blogs Blogs run through splog detector Link distribution of blogs vs. splogs plotted on a log-log scale Predictably only authentic blogs subscribe to a power-law (4, 5) 2. Characterizing the Splogosphere Blogosphere dump for 21 days of July 2005 1.3 million total blogs Blogs run through splog detector Link distribution of blogs vs. splogs plotted on a log-log scale Predictably only authentic blogs subscribe to a power-law (4, 5) Continuing Work Inducing new features for splog detection Language Independent and Adaptive Techniques for Splog Detection Splog Taxonomy and Evaluation Metrics Multi-Relational Local Models for Splog Detection Tuning memeta to harvest blogs regularly Continuing Work Inducing new features for splog detection Language Independent and Adaptive Techniques for Splog Detection Splog Taxonomy and Evaluation Metrics Multi-Relational Local Models for Splog Detection Tuning memeta to harvest blogs regularly Blogosphere Analytics Blog Directories Ping Servers Search Engines Blog Crawler Language Identifier Language Identifier Blog Identifier (98% Accuracy) Blog Identifier (98% Accuracy) Splog Detector (87% Accuracy) Splog Detector (87% Accuracy) BLOGS + Heuristics Language Identifiers Blog Identification Spam Blog Detectors IP Blacklists Authentic Blogs Spam Blogs Splog Detector Host Distribution of Pings at weblogs.com Nature of pinging URLs at weblogs.com Ping time-series of Italian blogs over five days Ping time-series of Italian blogs on a single day Ping time-series of Authentic blogs on a single day Ping time-series of Spam blogs on a single day Ping time-series of Spam blogs over five days Ping time-series of Authentic blogs over five days 4 5 Only in-degree distribution of authentic blogs subscribe to a power law Only out-degree distribution of authentic blogs subscribe to a power law