Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of.

Similar presentations


Presentation on theme: "Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of."— Presentation transcript:

1 Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of The Eurovision Song Contest in Copenhagen 2014  Cooperation with researchers NAS workshop, Paris 2014/Sabine Schostag

2 Why focus on social media?  Nowadays social media are the primary communication platforms during cultural and political events  Politicians, artists, musicians, even the traditional news media such as TV – use the social media more than traditional web pages  The entries on social media pages are ephemeral, so we need to capture them in a very high frequency NAS workshop, Paris 2014/Sabine Schostag

3 Which social media did we crawl?  Twitter.com comments  Youtube.com video and comments  Facebook.com comments  Live blogs Excluded for technical reasons Excluded for technical reasons …  instagram.com video and image  tumblr.com multimedia blog  flickr.com images  vimeo.com video NAS workshop, Paris 2014/Sabine Schostag

4 Which Tools did we use?  Harvesting with NetarchiveSuite using Heritrix 1.4*, weekly, daily and hourly  ”Crontab” based screen dumping of static url’s using PhantomJS to searchable PDF’s  Manually LAP (Live Archive Program) browsing  XML Extracts from API’s using own developed tools and/or Digitalfootprints.dkDigitalfootprints.dk  Harvesting YouTube videos by extracting the video url’s from the “watch-url” pages with own developed tool  Screenrecording using CamStudio.org and a Netlab.dk linux tool wrapping ”ffmpeg”Netlab.dk NAS workshop, Paris 2014/Sabine Schostag

5 …more about the automated screen filming tool  developed as part of research project by curator/researcher, now implemented as a tool  allows scheduled capturing  is well suited to capture pre-planned streamed content  is well suited to capture frequently updated content which refreshes automatically (no mouseclicks)  is not a replacement for existing collection methods, but a supplement NAS workshop, Paris 2014/Sabine Schostag

6 …more about the automated screen filming tool  The tool enables the user to programme every mouseclick, every interaction on the webpage NAS workshop, Paris 2014/Sabine Schostag

7 …some screenshots from the filming tool NAS workshop, Paris 2014/Sabine Schostag ESC 2004 and the European Parliament Elections 2014

8 Lessons learned  NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and the high frequency of feeds f.x. 47.000 tweets/minut.  You can record the ”look and feel” with screen recording and dumping, but it is a HUGE manual work producing files and provenance documentation outside the archive.  The LAP tool is not rather useful as it doesn’t support https (most of the social media use https today).  ”Digitalfootprints.dk” can archive almost all XML content for twitter and could be harvested afterwards by NetarchiveSuite Heritrix. NAS workshop, Paris 2014/Sabine Schostag

9 Current issues  wider access  better access (free text search)  inclusion of older net collections  collection of websites with restricted access  advanced web content, ie. with sound/video/live interaction (chat, virtual worlds …)  electronic communication networks ≠ the web  long-term preservation  documentation NAS workshop, Paris 2014/Sabine Schostag

10 … and from the techical point of view  more stable and operational screen recording and dumping tools for huge social media events  build social media API extract plugins into Heritrix and better support for WARC linking of e.g. Youtube watch and video download url’s.  Build scripting and https support into the LAP-tool.  upgrade NetarchiveSuite to Heritrix 3.* to better support js with AJAX (using the Umbra plugin) and continuously crawling. NAS workshop, Paris 2014/Sabine Schostag

11 Epilogue  For the first time in Netarchive’s history the whole team met for to days NAS workshop, Paris 2014/Sabine Schostag


Download ppt "Recent approaches to capture web content, which Heritrix can’t harvest  Capturing Social Media  Screen filming of Rich Media  Project: Event crawl of."

Similar presentations


Ads by Google