Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than URLs in 2011, par example
Technical Solutions 1 st crawl, August 2007 –so46c979db49349.addVariable("url", "http%3A%2F%2Fwww.dailymotion.com%2Fget%2F 14%2F320x240%2Fflv%2F flv%3Fkey%3Df d430fdc0700d90ecc01a53a4512e0656"); – 81.flv?key=f131548d430fdc0700d90ecc01a53a4512e 0656 i.e. a video file with an access key –Beanshell script in “extract-processors” chaine –Good result 919 seeds, video files collected
A Beanshell Script dailymotion.bsh import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; import java.net.*; import java.util.Collection; import java.util.logging.Level; import java.util.logging.Logger; import java.util.regex.Matcher; String trigger = "^(?i) String build = "$1"; process(CrawlURI curi) { int size = curi.getOutLinks().size(); if ( size == 0) { return; } // use array copy because implied URIs will be added to outlinks Link[] links = curi.getOutLinks().toArray(new Link[size]); for (Link outlink : links) { Matcher m = TextUtils.getMatcher(trigger, outlink.getDestination()); if (m.matches()) { String implied = m.replaceFirst(build); TextUtils.recycleMatcher(m); if (implied != null) { try { implied = URLDecoder.decode(implied, "utf8"); curi.createAndAddLink(implied, Link.SPECULATIVE_MISC,Link.SPECULATIVE_HOP); } catch (e) { System.out.println("Dailymotion beanshell processor: ERROR : Probably Bad URI " + e); } if (curi.getOutLinks().remove(outlink)) { System.out.println("Dailymotion beanshell processor: Outward link " + outlink + " has been removed form " + outlink.getSource()); } else { System.out.println("Dailymotion beanshell processor: ERROR: Outward link " + outlink + " has NOT been removed form " + outlink.getSource()); }
Technical Solutions 2 nd crawl, January 2008 –Beanshell script, –Rather good result 3811 seeds, video files collected 3 rd crawl, September 2008 –Beanshell script –Result less good 9683 seeds, but only videos found, HTTP 403 errors –Problem due to limited validity of access key (less than two hours) 4 th crawl, February 2009 –Crawled in two steps: First step, the videos pages, with a harvest template “Page + 1 click” In a second step, the video files, with a “video” harvest template and a Bash script to generate video file URIs with valid access keys –Rather good result seeds, video files collected
Technical Solutions How the two jobs solution works –Extraction of all video page URIs from first job’s crawl.log –Second job is configured with “pause-at-finish=true” –A Bash script is launched on the crawler machine which Checks the jobstate via JMX interface and wait until job is paused Fetches the video page with curl Extracts video file URI Feeds this URI to the job via JMX (importUri command) –20 crawlers worked in parallel for the 2011 crawl The big disadvantage: In the Wayback Machine, the video files are not accessible anymore via the video pages because of different access keys –But they are available via their URL –No solution found so far
Technical Solutions 5 th crawl, October 2009 –Two jobs solution –Rather good result 5659 seeds, video files collected 6 th crawl, November 2010 –Big surprise: Video file URIs directly in source code of video page, so no special solution needed –Good result 8649 seeds, videos collected 7 th crawl, July 2011 –The two jobs solution again –Result less good seeds, video files collected –But a new phenomenon arrived: Only unique video files A number of missing video files left. We don’t know why. That’s work for our next crawl …
Indicators CrawlSeeds Video files total Video files 200 Video files 403 Video files 200 unique%SizeSolution Works in WB GBBeanshellYes TBBeanshellYes GBBeanshellYes TBTwo jobsNo TBTwo jobsNo TBDirectNo TBTwo jobsNo
Examples We crawled : com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news The video file’s URL in our archives is : com/cdn/H x384/video/xk1mpz. mp4?auth= b2f0e2f64eb356828b e0911dbd2058
We didn’ crawl on the same page… denoncent-une-politique-de-stigmatisation_news
Which harvest template do you use? How do you manage to crawl Dailymotion? Today we need to reduce our seed list, so we test other harvest template: Ex : Users’ pages : dailymotion.com/user/20Minutes/… Videos’ pages : dailymotion.com/video/ 1 st solution : path + scope one plus 2 nd solution : path and page + 1
To access the videos in the Wayback Today it’s very complicated because the model changes each year The link between the video’s page and the video is broken because of the URL key Have you got a solution?