Download presentation
Presentation is loading. Please wait.
Published byCarol Newman Modified over 9 years ago
1
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014
2
Internet Memory Foundation Internet Memory Foundation (European Archive) Established in 2004 in Amsterdam and then in Paris: Mission: Preserve Web content by building a shared WA platform Actions: Dissemination, R&D and partnerships with research groups and cultural institutions Open Access Collections: UK National Archives & Parliament, PRONI, CERN The National Library of Ireland, etc. Internet Memory Research Spin-off of IMF established in June 2011 in Paris Mission: Operate large scale or selective crawls & develop new technologies (processing and extraction )
3
Internet Memory Foundation Focused crawling: Automated crawls through the Archivethe.Net shared platform Quality focused crawls : Video capture (You Tube channels), Twitter crawls, complex crawls Large scale crawling Inhouse developed distributed software Scalable crawler: MemoryBot Also designed for focused crawl and complex scoping
4
Research projects Web Archiving and Preservation Living Web Archives (2007-2010) Archives to Community MEMories: (2010-2013) SCAlable Preservation Environment (2010-2013) Webscale data Archiving and Extraction ✓ Living Knowledge (2009-2012) ✓ Longitudinal Analytics of Web Archive data (2010-2013)
5
MemoryBot design (1) Started in 2010 with the support of the LAWA (Longitudinal Analytics of Web Archive data) project URL store designed for large-scale crawls (DRUM) Built in Erlang: distributed and fault-tolerant system language Distributed (consistent hashing) Robust: topology change adaptation, memory usage regulation, process isolation
6
MemoryBot design (2)
7
MemoryBot performance Good throughput and slow decrease 85 resources written per second, slowing to 55 after 4 weeks on a nine 8-core servers cluster (32 GiB of RAM)
8
MemoryBot counters
10
MemoryBot – quality Support of HTTPS, retries on server failure, configurable URL canonicalisation Scope: domain suffixes, language, hops sequence, white lists, black lists Priorities Trap detection (URL pattern identification, within PLD duplicate detection)
11
MemoryBot – multi-crawl Easier management Politeness observed across different crawls Better resource utilisation
12
IM Infrastructure Green datacenters Through a collaboration with NoRack Designed for massive storage (petabytes of data) Highly scalable/low consumption Reduces storage and processing costs Repository : HDFS (Hadoop File System): Distributed, fault-tolerant file system Hbase. A distributed key-value index (temporal archives) MapReduce: A distributed execution framework
13
IM Platform (1) Data storage: temporal aspect (versions ) Organised data: Fast and easy access to content Easy processing distribution (Big Data) Several views on same data: Raw, extracted and/or analysed Takes care of data replication: No (W)ARC synchronisation required
14
IM Platform (2) Extensive characterisation and data mining actions: Process and reprocess information any time depending on needs/requests – Extract information such as MIME type, text resources, images metadata, etc.
15
SCAlable Preservation Environment (SCAPE) QA/Preservation challenges? Growing size of web archives Ephemeral and heterogenous content Costly tools/actions Develop scalable quality assurance tools Enhance existing characterisation tools
16
Visual automated QA: Pagelizer Visual and structural comparison tool developped by the UPMC as part of SCAPE Trained and enhanced through a collaboration with IMF Wrapped by IMF team to be used at large scale within its platform Allows comparison of two web pages snapshots Provides a similarity score as an output
17
Visual automated QA: Pagelizer Tested on 13 000 pairs of URLs (Firefox & Opera) 75% of correct assessment Whole workflow runs for around 4 seconds/pair 2 seconds for screenshot (depends on page rendered) 2 seconds for comparison Performance already cut per 2 since initial tests (map reduce)
18
Next steps Improvements are to be made: Performance Robustness Correctness New test in progress on a large scale crawl: Results to be disseminated to the community through the SCAPE project and through on-site demos (contact IMF)!
19
Thank you. Any questions? http://internetmemory.org - http://archivethe.net florent.carpentier@internetmemory.org leila.medjkoune@internetmemory.org
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.