Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft
Requirements Timeliness Coverage Scale Data Quality
Crawling Options Firehose – Six Apart – Word Press $$$ Feeds – Feeds – Ping servers
Static List of Feeds Simplification Distributed Crawl DNS Resolution Redirections resolution Broken URLs resolution Robots.txt Fits into memory
List Management Crawler 1 Crawler N Archive
List Management System Find new Feeds Blog vs Non-blog Spam Asses and Remove – Stale feeds – Duplicate feeds – Low quality/non-blog/spam feeds – Asses the size of the list
Crawler Constraints Politeness Network/RAM/CPU Requirement for the Latency of the Output
URL1 URL2 URL3 … URL1 URL2 URL3 …. Connection1 Connection2 Connection1 Connection2 Design Per IP buckets Each bucket has a priority queue of URLs
Priority Queue Expected time of the new post – Last ping time – Time of the last post plus mean period between posts Importance of the feed Combination of above
Learnings Quality feed discovery is hard Blog vs non-blog classification is hard Can’t have too many connections IPs tend to change Broken feeds/general feed variety Broken feed URLs
QUESTIONS