Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than 13000 URLs in 2011,

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Months of the year December January November October February
Journal Wizard. 1.Left click bar 2.Select Download File 3.Select Open.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Chubaka Producciones Presenta :.
  Adds “Share” button to any webpage  Add it to a template page so it’ll be on every page  Select.
2012 JANUARY Sun Mon Tue Wed Thu Fri Sat
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Whole School Attendance Whole School Attendance 94.64% Overall School Absence 5.36%
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Archive-It Training University of Maryland July 12, 2007.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
MONDAYTUESDAYWEDNESDAYTHURSDAYFRIDAYSAT/SUN Note: You can print this template to use as a wall calendar. You can also copy the slide for any month to add.
You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation. If you’d like to change.
School Year Calendar You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation.
School Year Calendar You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation.
You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation. If you’d like to change.
2007 Monthly Calendar You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation.
You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation. If you’d like to change.
You can print this template to use it as a wall calendar, or you can copy the page for any month to add it to your own presentation. If you’d like to change.
CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.
Chapter 4 Code Editor Goals and Objectives Program more efficiently? How can you speed up your development process? Do you want to learn useful shortcuts.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
Web forms in PHP Forms Recap  Way of allowing user interaction  Allows users to input data that can then be processed by a program / stored in a back-end.
MIS 3200 – Unit 6.2 Learning Objectives How to move data between pages – Using Query Strings How to control errors on web pages – Using Try-catch.
Cookies Set a cookie – setcookie() Extract data from a cookie - $_COOKIE Augment user authentication script with a cookie.
User Account Administration
Count Reports Updated July 2012 Wisconsin Department of Public Instruction.
CISC474 - JavaScript 03/02/2011. Some Background… Great JavaScript Guides: –
Open Source Server Side Scripting ECA 236 Open Source Server Side Scripting Includes and Dates.
Session 1: Advanced Content Model Wednesday 06 February 2007 Sitecore for Experts “Sitecore skills for real men”
Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris.
Session tracking There are a number of problems that arise from the fact that HTTP is a "stateless" protocol. In particular, when you are doing on- line.
14. Uploading Files to MySQL Database. M. Udin Harun Al Rasyid, S.Kom, Ph.D Desain dan.
One of the more common problems Suppliers can experience is having missing or stuck documents in Caplink. Suppliers usually become aware of the issue.
1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.
Chapter 10: BASH Shell Scripting Fun with fi. In this chapter … Control structures File descriptors Variables.
WORD JUMBLE. Months of the year Word in jumbled form e r r f b u y a Word in jumbled form e r r f b u y a february Click for the answer Next Question.
This document gives one example of how one might be able to “fix” a meteorological file, if one finds that there may be problems with the file. There are.
1 Advanced Archive-It Application Training: Crawl Scoping.
GOSS iCM Forms Gary Ratcliffe. 2 Agenda Webinar Programme Form Groups Publish Multiple Visual Script Editor Scripted Actions Form Examples.
ICM – API Server & Forms Gary Ratcliffe.
Object Oriented Programming (OOP) LAB # 1 TA. Maram & TA. Mubaraka TA. Kholood & TA. Aamal.
ICM – API Server Gary Ratcliffe. 2 Agenda Webinar Programme API Server Overview JSON-RPC iCM API Service API Server and Forms New services under.
© Janice Regan, CMPT 128, February CMPT 128: Introduction to Computing Science for Engineering Students Recursion.
2011 Calendar Important Dates/Events/Homework. SunSatFriThursWedTuesMon January
Coding Time This is a starter activity and should take about 10 minutes [ slide 1 ] 1.Log in to your computer 2.Open IDLE 3.Start a script session (Select.
CSC 405: Web Application Engineering II8.1 Web programming using PHP What have we learnt? What have we learnt? Underlying technologies of database supported.
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
NMD202 Web Scripting Week2. Web site
I/O Software CS 537 – Introduction to Operating Systems.
July 2007 SundayMondayTuesdayWednesdayThursdayFridaySaturday
PHP: Further Skills 02 By Trevor Adams. Topics covered Persistence What is it? Why do we need it? Basic Persistence Hidden form fields Query strings Cookies.
UbiCrawler : a scalable fully distributed Web crawler P. Boldi, B. Codenotti, M. Santini, and S. Vigna, SPE Vol.34 No.2 pages , Feb Kyoung.
Arrays Collections of data Winter 2004CS-1010 Dr. Mark L. Hornick 1.
2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.
IN THIS Slide show YOU WILL LEARN ABOUT ALL VERSIONS OF "MS OFFICE"
Containers and Lists CIS 40 – Introduction to Programming in Python
Crawling with Heritrix
Two-Tiered Crawling Approach
Problem Gambling Clicks to Opgr.org
McDonald’s calendar 2007.
The Transformation of A Small Company Into A Golden Legend
McDonald’s calendar 2007.
2015 January February March April May June July August September
Presentation transcript:

Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains more than URLs in 2011, par example

Technical Solutions 1 st crawl, August 2007 –so46c979db49349.addVariable("url", "http%3A%2F%2Fwww.dailymotion.com%2Fget%2F 14%2F320x240%2Fflv%2F flv%3Fkey%3Df d430fdc0700d90ecc01a53a4512e0656"); – 81.flv?key=f131548d430fdc0700d90ecc01a53a4512e 0656 i.e. a video file with an access key –Beanshell script in “extract-processors” chaine –Good result 919 seeds, video files collected

A Beanshell Script dailymotion.bsh import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.extractor.Link; import org.archive.util.TextUtils; import java.net.*; import java.util.Collection; import java.util.logging.Level; import java.util.logging.Logger; import java.util.regex.Matcher; String trigger = "^(?i) String build = "$1"; process(CrawlURI curi) { int size = curi.getOutLinks().size(); if ( size == 0) { return; } // use array copy because implied URIs will be added to outlinks Link[] links = curi.getOutLinks().toArray(new Link[size]); for (Link outlink : links) { Matcher m = TextUtils.getMatcher(trigger, outlink.getDestination()); if (m.matches()) { String implied = m.replaceFirst(build); TextUtils.recycleMatcher(m); if (implied != null) { try { implied = URLDecoder.decode(implied, "utf8"); curi.createAndAddLink(implied, Link.SPECULATIVE_MISC,Link.SPECULATIVE_HOP); } catch (e) { System.out.println("Dailymotion beanshell processor: ERROR : Probably Bad URI " + e); } if (curi.getOutLinks().remove(outlink)) { System.out.println("Dailymotion beanshell processor: Outward link " + outlink + " has been removed form " + outlink.getSource()); } else { System.out.println("Dailymotion beanshell processor: ERROR: Outward link " + outlink + " has NOT been removed form " + outlink.getSource()); }

Technical Solutions 2 nd crawl, January 2008 –Beanshell script, –Rather good result 3811 seeds, video files collected 3 rd crawl, September 2008 –Beanshell script –Result less good 9683 seeds, but only videos found, HTTP 403 errors –Problem due to limited validity of access key (less than two hours) 4 th crawl, February 2009 –Crawled in two steps: First step, the videos pages, with a harvest template “Page + 1 click” In a second step, the video files, with a “video” harvest template and a Bash script to generate video file URIs with valid access keys –Rather good result seeds, video files collected

Technical Solutions How the two jobs solution works –Extraction of all video page URIs from first job’s crawl.log –Second job is configured with “pause-at-finish=true” –A Bash script is launched on the crawler machine which Checks the jobstate via JMX interface and wait until job is paused Fetches the video page with curl Extracts video file URI Feeds this URI to the job via JMX (importUri command) –20 crawlers worked in parallel for the 2011 crawl The big disadvantage: In the Wayback Machine, the video files are not accessible anymore via the video pages because of different access keys –But they are available via their URL –No solution found so far

Technical Solutions 5 th crawl, October 2009 –Two jobs solution –Rather good result 5659 seeds, video files collected 6 th crawl, November 2010 –Big surprise: Video file URIs directly in source code of video page, so no special solution needed –Good result 8649 seeds, videos collected 7 th crawl, July 2011 –The two jobs solution again –Result less good seeds, video files collected –But a new phenomenon arrived: Only unique video files A number of missing video files left. We don’t know why. That’s work for our next crawl …

Indicators CrawlSeeds Video files total Video files 200 Video files 403 Video files 200 unique%SizeSolution Works in WB GBBeanshellYes TBBeanshellYes GBBeanshellYes TBTwo jobsNo TBTwo jobsNo TBDirectNo TBTwo jobsNo

Examples We crawled : com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news com/video/xk1mpz_la- transition-a-commence- a-herat-dans-l-ouest- de-l-afghanistan_news The video file’s URL in our archives is : com/cdn/H x384/video/xk1mpz. mp4?auth= b2f0e2f64eb356828b e0911dbd2058

We didn’ crawl on the same page… denoncent-une-politique-de-stigmatisation_news

Which harvest template do you use? How do you manage to crawl Dailymotion? Today we need to reduce our seed list, so we test other harvest template: Ex : Users’ pages : dailymotion.com/user/20Minutes/… Videos’ pages : dailymotion.com/video/ 1 st solution : path + scope one plus 2 nd solution : path and page + 1

To access the videos in the Wayback Today it’s very complicated because the model changes each year The link between the video’s page and the video is broken because of the URL key Have you got a solution?