Presentation is loading. Please wait.

Presentation is loading. Please wait.

Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,

Similar presentations


Presentation on theme: "Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,"— Presentation transcript:

1 Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA IWAW 2007 Vancouver, BC June 23, 2007

2 2 Agenda Screen-scraping the web user interface (WUI) Search engine APIs Comparing search results Five month experiment Significant findings and conclusions

3 3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4 4

5 5 A couple weeks ago I… accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an email asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH! So right after I signed up with a better hosting company I had to figure out a plan B.

6 6 Crawling the Crawlers

7 7 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/http://warrick.cs.odu.edu/

8 8

9 9

10 10 Cached Image

11 11 Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

12 12 Examples of Lost Websites Recovered with Warrick

13 13 Web Crawler

14 14 Web-Repository Crawler

15 15 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)

16 16 Problems with Warrick Requires user to download, install, and run from the command line warrick.pl –d –r –o log.txt –c –wr ia http://foo.org/ Google API keys are no longer available Screen-scrapes Google’s web user interface which can cause Google to black-list an IP address

17 17 Solution: Brass Queueing system using ODU nodes, so API query limits can be spread across several machines Uses Google API keys which we obtained before they were no longer made available Easy-to-use web interface utilizing email to notify user when reconstructions are complete

18 18 Warrick Brown Captain Jim Brass http://www.cbs.com/primetime/csi/bios/index.php?cast_member=gary

19 19

20 20

21 21

22 22

23 23 Brass Architecture

24 24 Job Processing 1.Pending – Waiting to be confirmed 2.Queued – Waiting to be started 3.Processing – Currently being executed 4.Complete – Ready to be picked-up

25 25 Other Warrick Deployments GUI interface for client executable –Installation difficulties –Lack of Google API keys Web interface along with client application which makes queries –Browser plug-in, Flash, or applet –Must manage Google API keys –Browser must be left open and continued Internet access

26 26 Conclusions Warrick interface is almost ready for the public Web interface will likely greatly increase Warrick usage Collection of usage data will allow us to better understand what kinds of websites the public is interesting in recovering

27 27 Frank McCown fmccown@cs.odu.edu fmccown@cs.odu.edu And that’s everything there is to know about Brass! And a lot more than I wanted to know…


Download ppt "Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,"

Similar presentations


Ads by Google