Download presentation
Presentation is loading. Please wait.
Published byErica Griffith Modified over 9 years ago
1
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International Conference on Theory and Practice of Digital Libraries September 22-26, 2013 Valletta, Malta 1
2
2
3
Where can you find? 3 http://www.google.com/
4
Where can you find? 4 http://www.google.com/
5
Where can you find? 5 http://www.japantimes.co.jp/
6
Where can you find? 6 http://www.japantimes.co.jp/
7
Research Question Problem We need to profile the web archives around the world with these characteristics: o Age o Top-level domains o Languages o Growth rateGoal To optimize the query routing for Memento Aggregator. To determine the missing parts of the web. 7
8
Web Archives under this Experiment Full textURI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √√ UK National Library √√ Portuguese Web Archive √√ Web Archive of Catalonia √√ Croatian Web Archive √√ Archive of the Czech Web √√ National Taiwan University √√ Archive It √√ 8
9
Experiment Sampling from different sources Retrieve the TimeMap from each archive Analyze the TimeMaps 9
10
URIs Samples Sources Web 1.DMOZ – Random sample 2.DMOZ – TLD 2% of each TLD from DMOZ (.com,.org,.jp, etc 52 TLD) 3.DMOZ – Languages 100 URIs for each Languages (24 lang.) Web Archives 4.Top 1-Gram from Bing 5.Top 1000 queries term by Yahoo in 9 languages User requests 6.IA Wayback Machine Log files 7.Memento aggregator log files * We used hostnames only 10
11
URIs Samples Sources Web 1.DMOZ – Random sample o 10,000 URIs randomly sample from DMOZ directory (~5M URIs). 2.DMOZ – TLD 2% for each TLD from DMOZ or 100 URIs which are greater o 52 TLDs ( com 23,470) ( de 6,332), ( org 4,025), ( uk 3,309), ( net 2,073), ( it 1,775), ( jp 1379), ( ru 1244), ( fr 1154), ( pl 1062), ( au 764), ( ca 642), ( at 438), ( edu 390), ( cz 385), ( tr 334), ( info 319), ( cn 278), ( us 266), ( nz 265), ( es 238), ( ar 213), ( no 150), ( br 149), ( tw 141), ( za 118), ( fi 113), ( 100 URIS for [ ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw ]) 3.DMOZ – Languages 100 URIs for each Languages o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish, Russian, Turkish, Ukrainian 11
12
URIs Samples Sources Web Archive Query the fulltext search interface for the web archives with two set of query terms. 4.Top 1-Gram from Bing o Most of them is English 5.Top 1000 queries term by Yahoo in 9 languages o We excluded the general keywords such as: Obama, Facebook. 12
13
URIs Samples Sources Web Archive Chinese English French German Italian Japanese Korean Portuguese Spanish Total Top 1 Gram Archive with FullText search AIT262066351238373321119224342141126173953 BL16323542350224020682251311940205664303187 CAN498008046466017711358051413511107 CR54706697703701741959960015991201 CZ36317821578169515195771141310127860813360 CAT2827752496244822802091292164242989964241 PO912460360330813113536932673177141265004 TW35717817616515710671981191004354 UK02698200920492046001903187182613431 13
14
URIs Samples Sources User requests Sampling from the users requests to the web archived materials 6.Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 7.Sample from Memento aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. 14
15
General Coverage 15
16
Web Archive Growth Rate 16
17
TLD Sample Coverage 17
18
TLD per archive (TLD Sample) 18
19
TLD per archive (Fulltext search) 19
20
TLD across archives 20
21
Languages distribution per archive 21
22
Query Routing Evaluation 22
23
Conclusions New automatic technique to profile the web archive using the available interface. Internet Archive provide broad coveage. National archives have good coverage for their domains. The evaluation showed that we can retrieve the full TimeMap in 84% using only 3 archives. 23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.