Download presentation
Presentation is loading. Please wait.
Published byRuth Hood Modified over 9 years ago
1
Recent approaches to capture web content, which Heritrix can’t harvest Capturing Social Media Screen filming of Rich Media Project: Event crawl of The Eurovision Song Contest in Copenhagen 2014 Cooperation with researchers NAS workshop, Paris 2014/Sabine Schostag
2
Why focus on social media? Nowadays social media are the primary communication platforms during cultural and political events Politicians, artists, musicians, even the traditional news media such as TV – use the social media more than traditional web pages The entries on social media pages are ephemeral, so we need to capture them in a very high frequency NAS workshop, Paris 2014/Sabine Schostag
3
Which social media did we crawl? Twitter.com comments Youtube.com video and comments Facebook.com comments Live blogs Excluded for technical reasons Excluded for technical reasons … instagram.com video and image tumblr.com multimedia blog flickr.com images vimeo.com video NAS workshop, Paris 2014/Sabine Schostag
4
Which Tools did we use? Harvesting with NetarchiveSuite using Heritrix 1.4*, weekly, daily and hourly ”Crontab” based screen dumping of static url’s using PhantomJS to searchable PDF’s Manually LAP (Live Archive Program) browsing XML Extracts from API’s using own developed tools and/or Digitalfootprints.dkDigitalfootprints.dk Harvesting YouTube videos by extracting the video url’s from the “watch-url” pages with own developed tool Screenrecording using CamStudio.org and a Netlab.dk linux tool wrapping ”ffmpeg”Netlab.dk NAS workshop, Paris 2014/Sabine Schostag
5
…more about the automated screen filming tool developed as part of research project by curator/researcher, now implemented as a tool allows scheduled capturing is well suited to capture pre-planned streamed content is well suited to capture frequently updated content which refreshes automatically (no mouseclicks) is not a replacement for existing collection methods, but a supplement NAS workshop, Paris 2014/Sabine Schostag
6
…more about the automated screen filming tool The tool enables the user to programme every mouseclick, every interaction on the webpage NAS workshop, Paris 2014/Sabine Schostag
7
…some screenshots from the filming tool NAS workshop, Paris 2014/Sabine Schostag ESC 2004 and the European Parliament Elections 2014
8
Lessons learned NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and the high frequency of feeds f.x. 47.000 tweets/minut. You can record the ”look and feel” with screen recording and dumping, but it is a HUGE manual work producing files and provenance documentation outside the archive. The LAP tool is not rather useful as it doesn’t support https (most of the social media use https today). ”Digitalfootprints.dk” can archive almost all XML content for twitter and could be harvested afterwards by NetarchiveSuite Heritrix. NAS workshop, Paris 2014/Sabine Schostag
9
Current issues wider access better access (free text search) inclusion of older net collections collection of websites with restricted access advanced web content, ie. with sound/video/live interaction (chat, virtual worlds …) electronic communication networks ≠ the web long-term preservation documentation NAS workshop, Paris 2014/Sabine Schostag
10
… and from the techical point of view more stable and operational screen recording and dumping tools for huge social media events build social media API extract plugins into Heritrix and better support for WARC linking of e.g. Youtube watch and video download url’s. Build scripting and https support into the LAP-tool. upgrade NetarchiveSuite to Heritrix 3.* to better support js with AJAX (using the Umbra plugin) and continuously crawling. NAS workshop, Paris 2014/Sabine Schostag
11
Epilogue For the first time in Netarchive’s history the whole team met for to days NAS workshop, Paris 2014/Sabine Schostag
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.