Download presentation
Presentation is loading. Please wait.
Published byGinger Norman Modified over 9 years ago
1
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands Institute for Sound and Vision
2
Our history of web archiving 2008-2010 2011-2012 2008-2010
3
Purposes of web archiving
4
What Web archiving is not
5
Web archiving as a context collection
7
Current project: selection of sites: broadcaster
8
Current project: selection of sites
9
Issues and challenges
11
Current status
12
Front end & back end
14
jvytopil@beeldengeluid.nl
15
Web Archiving in audiovisual field Studiedag webarchivering in Nederland, Hilversum, October 30, 2014 Chloé Martin chloe@internetmemory.net http://archivethe.net
16
Web archiving
17
What? & Why? What is a Web archive? A copy of website Recorded by a crawler At a specific date and time Look and feel like a real website For Whom? Any institution whose aim is to collect & preserve web/media material for historical, cultural, heritage or legal (compliance) purpose Pervasive Dynamic Valuable Web content Variety of format Ephemeral Why?
18
How? Collection policy Management tools Quality control Access
19
Web Archiving Team Put in place a cross-disciplinary team ‣ Curator / Librarian / Archivist ‣ Information system technician Train a team ‣ Web archivist / Project Manager ‣ Engineer(s) to design & monitor the whole process (for in house solution) Web archiving requires critical skills and experience, especially concerning engineers in the case of an in-house solution
20
Collection policy
21
Extensive Collection vs Intensive Collection
22
How to i i mprove Selection Policy IMR value propositions: [Topic crawls] Percolable, a tool to discover relevant sources [Crawl of actives sources] Automated refreshment rate [Large Crawls] Smart discovery crawl based on topic or language
23
How? Collection policy Management tools Quality control Access
24
Archivethe.net
25
User Interface
26
Challenges: Technical issues Deep & Hidden Web Webspams and Traps Dynamic websites Social Web (Twitter, FB, YouTube, Flickr,...)Twitter YouTubeFlickr Video
27
Challenges: Video B&G Screenshot
28
OurTube / Our Tweet screenshot Challenges: Social Media
29
Quality Assurance
30
Access
31
Access & Search Browsing in the archive URL Full Text with Elastic Search Full Text + Branding (search, web archive)searchweb archive Automatic redirection Automated categorization Semantic expansion
32
Extract valuable information From your large corpus for Users / Researchers Cleaned text Keywords to add Cloud Outlinks to analyze Graphs Structure unstructured data (forums,...) Named entities More are coming soon...
33
About IMR Internet Memory Research ✓ Spin-off of the Internet Memory Foundation, French start-up, founded in 2011 ✓ 20+ engineers actively engaged in the Web Archiving and Information Mining field ✓ EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP ✓ Large Scale Crawler with high performances ✓ Scalable platform based on a distributed architecture and Big Data components (Hadoop, Hbase, HDFS,…) ✓ Innovative infrastructure with low consumption
34
About IMR Any Question? http://archivethe.net chloe@internetmemory.net Twitter ArchiveTheNet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.