Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands Institute for Sound and Vision
Our history of web archiving
Purposes of web archiving
What Web archiving is not
Web archiving as a context collection
Current project: selection of sites: broadcaster
Current project: selection of sites
Issues and challenges
Current status
Front end & back end
Web Archiving in audiovisual field Studiedag webarchivering in Nederland, Hilversum, October 30, 2014 Chloé Martin
Web archiving
What? & Why? What is a Web archive? A copy of website Recorded by a crawler At a specific date and time Look and feel like a real website For Whom? Any institution whose aim is to collect & preserve web/media material for historical, cultural, heritage or legal (compliance) purpose Pervasive Dynamic Valuable Web content Variety of format Ephemeral Why?
How? Collection policy Management tools Quality control Access
Web Archiving Team Put in place a cross-disciplinary team ‣ Curator / Librarian / Archivist ‣ Information system technician Train a team ‣ Web archivist / Project Manager ‣ Engineer(s) to design & monitor the whole process (for in house solution) Web archiving requires critical skills and experience, especially concerning engineers in the case of an in-house solution
Collection policy
Extensive Collection vs Intensive Collection
How to i i mprove Selection Policy IMR value propositions: [Topic crawls] Percolable, a tool to discover relevant sources [Crawl of actives sources] Automated refreshment rate [Large Crawls] Smart discovery crawl based on topic or language
How? Collection policy Management tools Quality control Access
Archivethe.net
User Interface
Challenges: Technical issues Deep & Hidden Web Webspams and Traps Dynamic websites Social Web (Twitter, FB, YouTube, Flickr,...)Twitter YouTubeFlickr Video
Challenges: Video B&G Screenshot
OurTube / Our Tweet screenshot Challenges: Social Media
Quality Assurance
Access
Access & Search Browsing in the archive URL Full Text with Elastic Search Full Text + Branding (search, web archive)searchweb archive Automatic redirection Automated categorization Semantic expansion
Extract valuable information From your large corpus for Users / Researchers Cleaned text Keywords to add Cloud Outlinks to analyze Graphs Structure unstructured data (forums,...) Named entities More are coming soon...
About IMR Internet Memory Research ✓ Spin-off of the Internet Memory Foundation, French start-up, founded in 2011 ✓ 20+ engineers actively engaged in the Web Archiving and Information Mining field ✓ EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP ✓ Large Scale Crawler with high performances ✓ Scalable platform based on a distributed architecture and Big Data components (Hadoop, Hbase, HDFS,…) ✓ Innovative infrastructure with low consumption
About IMR Any Question? Twitter ArchiveTheNet