INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014
Programming Assignment #3
Web Crawler Offline processing Dashboard
Web Crawler Google – Crawl websites & index for search engine Amazon – Crawl web to price match w/ Amazon’s price Aggregate content – Shopping (Nextag) – News (finance.google.com)
Offline processing & Dashboard Offline/async processing – Facebook Lookback – Twitter fire hose and analyze sentiments – YouTube video compression (upload then compress) – Anything that takes > 5s to load => do offline! Dashboard – Easy way to see status of offline processing
Final Product Azure Cloud Service Worker Role Web Role Web Role -dashboard.aspx -status, #urls, last 10, etc -admin.asmx -ClearIndex -Also stops current crawling -StartCrawling -GetPageTitle Worker Role -Read URL from Queue -Crawl websites -Store title to Table -Add URLs found to Queue
Great User Experience Refresh dashboard => gets me new data ASMX admin page should return relevant status such as “Index Cleared” instead of void/empty string, consider other cases. Remove duplicates Only crawl websites in the same domain as your seed URL.
Start Now! (ok… after PA2)
Deliverables Due on May 19, 11pm PST Submit on Canvas Please submit the following as a single zip file: URL to your Azure instance hosting the dashboard (readme.txt), make sure crawling is complete! URL to your GitHub repro (share your GitHub with me & TA) in readme.txt Visual Studio 2013 project & source code Screenshot of your Azure dashboard with Instance running (azure- compute.jpg) Write up explaining how you implemented everything. Make sure to address each of the requirements, writeup.txt (~500 words) Extra credits – short paragraph in extracredits.txt for each extra credit (how to see/trigger/evaluate/run your extra credit feature and how you implemented it)
Hint Respect robots.txt (google it, it’s a simple format) Only need to crawl pages in the same domain Keep a list of already visited URLs, don’t re-crawl them, store in a fast lookup data structure Think about where to store stats Your code should handle 2+ worker threads. Think about concurrency in updating dashboard stats Local hosting/debugging = run as Admin
Sitemaps Start with these 2 robots.txt & sitemaps ( and For the CNN.com sitemap, ignore URLs > 2 months old; for the sportsillustrated sitemap, ignore non-nba related URLs
Extra Credit [10pts] Multi-threaded crawler [10pts] Crawl & index HTML body text (remove HTML tags)* [10pts] Graphical dashboard (shows stats over time) [5pts] Crawl more root domains (imdb, forbes, bbc, espn, Wikipedia, 1 pts per domain)
Questions?