Download presentation
Presentation is loading. Please wait.
Published byLeslie Cunningham Modified over 9 years ago
1
Social Streams Blog Crawler Matthew Hurst Alexey Maykov Live Labs, Microsoft
2
Requirements Timeliness Coverage Scale Data Quality
3
Crawling Options Firehose – Six Apart – Word Press $$$ Feeds – Feeds – Ping servers
4
Static List of Feeds Simplification Distributed Crawl DNS Resolution Redirections resolution Broken URLs resolution Robots.txt Fits into memory
5
List Management Crawler 1 Crawler N Archive
6
List Management System Find new Feeds Blog vs Non-blog Spam Asses and Remove – Stale feeds – Duplicate feeds – Low quality/non-blog/spam feeds – Asses the size of the list
7
Crawler Constraints Politeness Network/RAM/CPU Requirement for the Latency of the Output
8
168.193.56.21 URL1 URL2 URL3 …. 212.3.106.1 URL1 URL2 URL3 …. Connection1 Connection2 Connection1 Connection2 Design Per IP buckets Each bucket has a priority queue of URLs
9
Priority Queue Expected time of the new post – Last ping time – Time of the last post plus mean period between posts Importance of the feed Combination of above
10
Learnings Quality feed discovery is hard Blog vs non-blog classification is hard Can’t have too many connections IPs tend to change Broken feeds/general feed variety Broken feed URLs
11
QUESTIONS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.