Presentation is loading. Please wait.

Presentation is loading. Please wait.

Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft.

Similar presentations


Presentation on theme: "Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft."— Presentation transcript:

1 Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft

2 Requirements Timeliness Coverage Scale Data Quality

3 Crawling Options Firehose – Six Apart – Word Press $$$ Feeds – Feeds – Ping servers

4 Static List of Feeds Simplification Distributed Crawl DNS Resolution Redirections resolution Broken URLs resolution Robots.txt Fits into memory

5 List Management Crawler 1 Crawler N Archive

6 List Management System Find new Feeds Blog vs Non-blog Spam Asses and Remove – Stale feeds – Duplicate feeds – Low quality/non-blog/spam feeds – Asses the size of the list

7 Crawler Constraints Politeness Network/RAM/CPU Requirement for the Latency of the Output

8 168.193.56.21 URL1 URL2 URL3 …. 212.3.106.1 URL1 URL2 URL3 …. Connection1 Connection2 Connection1 Connection2 Design Per IP buckets Each bucket has a priority queue of URLs

9 Priority Queue Expected time of the new post – Last ping time – Time of the last post plus mean period between posts Importance of the feed Combination of above

10 Learnings Quality feed discovery is hard Blog vs non-blog classification is hard Can’t have too many connections IPs tend to change Broken feeds/general feed variety Broken feed URLs

11 QUESTIONS


Download ppt "Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft."

Similar presentations


Ads by Google