Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011,

Similar presentations


Presentation on theme: "Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011,"— Presentation transcript:

1 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011, par example http://www.dailymotion.com/20Minutes, http://www.dailymotion.com/user/20Minutes/1, http://www.dailymotion.com/user/20Minutes/2

2 Technical Solutions 1 st crawl, August 2007 –so46c979db49349.addVariable("url", "http%3A%2F%2Fwww.dailymotion.com%2Fget%2F 14%2F320x240%2Fflv%2F3208281.flv%3Fkey%3Df1 31548d430fdc0700d90ecc01a53a4512e0656"); –http://www.dailymotion.com/get/14/320x240/flv/32082 81.flv?key=f131548d430fdc0700d90ecc01a53a4512e 0656 i.e. a video file with an access key –Beanshell script in “extract-processors” chaine –Good result 919 seeds, 11860 video files collected

3 A Beanshell Script dailymotion.bsh import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; import java.net.*; import java.util.Collection; import java.util.logging.Level; import java.util.logging.Logger; import java.util.regex.Matcher; String trigger = "^(?i)http://www.dailymotion.com/.*?video/(http%3A%2F%2F.*)$"; String build = "$1"; process(CrawlURI curi) { int size = curi.getOutLinks().size(); if ( size == 0) { return; } // use array copy because implied URIs will be added to outlinks Link[] links = curi.getOutLinks().toArray(new Link[size]); for (Link outlink : links) { Matcher m = TextUtils.getMatcher(trigger, outlink.getDestination()); if (m.matches()) { String implied = m.replaceFirst(build); TextUtils.recycleMatcher(m); if (implied != null) { try { implied = URLDecoder.decode(implied, "utf8"); curi.createAndAddLink(implied, Link.SPECULATIVE_MISC,Link.SPECULATIVE_HOP); } catch (e) { System.out.println("Dailymotion beanshell processor: ERROR : Probably Bad URI " + e); } if (curi.getOutLinks().remove(outlink)) { System.out.println("Dailymotion beanshell processor: Outward link " + outlink + " has been removed form " + outlink.getSource()); } else { System.out.println("Dailymotion beanshell processor: ERROR: Outward link " + outlink + " has NOT been removed form " + outlink.getSource()); }

4 Technical Solutions 2 nd crawl, January 2008 –Beanshell script, –Rather good result 3811 seeds, 62127 video files collected 3 rd crawl, September 2008 –Beanshell script –Result less good 9683 seeds, but only 47382 videos found, 30842 HTTP 403 errors –Problem due to limited validity of access key (less than two hours) 4 th crawl, February 2009 –Crawled in two steps: First step, the videos pages, with a harvest template “Page + 1 click” In a second step, the video files, with a “video” harvest template and a Bash script to generate video file URIs with valid access keys –Rather good result 10949 seeds, 73335 video files collected

5 Technical Solutions How the two jobs solution works –Extraction of all video page URIs from first job’s crawl.log –Second job is configured with “pause-at-finish=true” –A Bash script is launched on the crawler machine which Checks the jobstate via JMX interface and wait until job is paused Fetches the video page with curl Extracts video file URI Feeds this URI to the job via JMX (importUri command) –20 crawlers worked in parallel for the 2011 crawl The big disadvantage: In the Wayback Machine, the video files are not accessible anymore via the video pages because of different access keys –But they are available via their URL –No solution found so far

6 Technical Solutions 5 th crawl, October 2009 –Two jobs solution –Rather good result 5659 seeds, 145761 video files collected 6 th crawl, November 2010 –Big surprise: Video file URIs directly in source code of video page, so no special solution needed –Good result 8649 seeds, 135599 videos collected 7 th crawl, July 2011 –The two jobs solution again –Result less good 13406 seeds, 182538 video files collected –But a new phenomenon arrived: Only 96968 unique video files A number of missing video files left. We don’t know why. That’s work for our next crawl …

7 Indicators CrawlSeeds Video files total Video files 200 Video files 403 Video files 200 unique%SizeSolution Works in WB 2007-0891911 94511 8608511 79599206.3 GBBeanshellYes 2008-013 81162 22562 1279842 045681.0 TBBeanshellYes 2008-099 68378 22447 38230 84244 56394567.7 GBBeanshellYes 2009-0210 94973 335 060 494821.0 TBTwo jobsNo 2009-105 659146 501145 761740113 493781.5 TBTwo jobsNo 2010-118 649135 603135 5994133 184982.5 TBDirectNo 2011-0713 406182 538 096 968534.4 TBTwo jobsNo

8 Examples http://www.dailymotion.com/user/afp/1 http://www.dailymotion.com/user/afp/1 We crawled : http://www.dailymotion. com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news http://www.dailymotion. com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news The video file’s URL in our archives is : http://www.dailymotion. com/cdn/H264- 512x384/video/xk1mpz. mp4?auth=1315101965 - 79b2f0e2f64eb356828b e0911dbd2058

9 We didn’ crawl on the same page… http://www.dailymotion.com/video/xk1g8b_roms-les-associations- denoncent-une-politique-de-stigmatisation_news

10 Which harvest template do you use? How do you manage to crawl Dailymotion? Today we need to reduce our seed list, so we test other harvest template: Ex : http://www.dailymotion.com/20Minuteshttp://www.dailymotion.com/20Minutes Users’ pages : dailymotion.com/user/20Minutes/… Videos’ pages : dailymotion.com/video/ 1 st solution : path + scope one plus 2 nd solution : path and page + 1

11 To access the videos in the Wayback Today it’s very complicated because the model changes each year The link between the video’s page and the video is broken because of the URL key Have you got a solution?


Download ppt "Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011,"

Similar presentations


Ads by Google