Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.

Slides:



Advertisements
Similar presentations
From the UNL hypergraph to GETA's multilevel tree Etienne BLANC GETA, CLIPS-IMAG BP 53, F Grenoble cedex 09
Advertisements

Internationalized Domain Names: Opportunities for the Arab World Baher Esmat ICANN.
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
“…by 2014, about 34% of all new business software purchases will be consumed via SaaS…” - IDC, June 2010* Used by Over 50% of the Fortune % CIOs.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Translation Strategy and Roadmap CCNA Discovery CCNA Exploration ITE:
USERDEVELOPERADVERTISER.
Linkedin “Your Professional Networking Hub”. What is linkedin Linkedin is a social networking website for professionals. It’s highly homogenous with most.
< Translator Team > 25+ Languages, …and growing!.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
Lund University Libraries Head Office Update on International Seminar on Open Access for Developing Countries – Salvador, Bahia – Brazil September 21st-22.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
Advanced Auto attendant v3.0. December 2003 Page 2 New Auto Attendant Features for 3.0 Allow different languages on different dialogs New Language support.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Why do we study English? Form 9, unit 6.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
1 QUESTEL ORBIT.COM. 2 QUESTEL French company Producer and provider of online and internet services Collection of patents, trademarks, designs, scientific-technical.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
What is a Web Address?. What’s in a name? The URL (uniform resource locator) is just a technical word that means the address to a web page on the WWW.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Betsy L. Humphreys Betsy L. Humphreys Associate Director for Library Operations NLM, NIH, HHS NLM, NIH, HHS National Library.
sound-effects-sound-clips-family-feud-download sound-effects-sound-clips-family-feud-download-
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Foreign Language Enrollment Trends Trendy Languages Dennis Looney Modern Language Association Vanderbilt University Center for Second Language Studies.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
DLF Forum Nov OCLC Grid Services Roy Tennant Senior Program Officer OCLC Research EVERY CONNECTION has a starting point.
Rome May World demand trend Agricultural tractors Millions US $ Italian Institute for Foreign Trade.
Copyright © IBM Corp., The Eclipse™ Babel Project Translation Server Kit Lo IBM™ Corporation.
Profiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May.
European Day of Languages Friday 26 th September 2014.
New RCLayout. Do product layout 3 improvements All products Local databases New functionalities.
Content Mgmt Services eText Overview Digital Delivery Aug 7, 2012.
URL’s Anatomy 1.02 Understand how to validate, authenticate, and legally use information from the Internet.
(Electronic Mail) Most popular use of Internet technology Advantages Disadvantages Setting up an account Your account –User id and password.
5 th EI World Congress - Berlin, July 2007 Use of the Web and Internet Technologies to enhance Teacher Union Work.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
© 2012 IBM Corporation Introducing IBM Cognos Insight.
What can Parents Do to Help Their Children Learn?.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Customization in the PATENTSCOPE search system Cyberworld November 2013 Sandrine Ammann Marketing & communications officer.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
Council on the World Stage John Ellis (King’s College London) Formerly advisor to CERN DGs on relations with Non-Member States ‘Science for Peace’ Scientific.
ICESat/GLAS Status at NSIDC Doug Fowler NSIDC Product Team Lead PoDAG Oct , 2006.
EVALUATING WEB SITES AND SOURCES. Knowledge is Empowerment Today’s objective is to learn how to be critical with each resource you use in your literature.
Special Features: Colour: Blue/ Grey Shock resistant USB 3.0 mobile solid state drive Integrated USB 3.0 cable, no additional cables needed IP55 – dust.
The next 10 years of web globalization John Yunker Byte Level Research.
Languages of Europe Romance, Germanic, and Slavic.
Gravostyle 8 Detailed Launch Presentation
Profiling Web Archives
Who and What Links to the Internet Archive
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Digital Brand Management Services
Sales Presenter Available now
Oracle Supplier Management Solution Product Availability
Retrospective 2017 & Future Plans
ارائه دهنده: مهندس عبداللهی
Digital Asset Management Part 11: Access

Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
Sales Presenter Available now Standard v Slim
Countries and nationalities
Lars Björnshauge, Lund University Libraries
Content Mgmt Services Digital Delivery Update February 2013

Presentation transcript:

Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International Conference on Theory and Practice of Digital Libraries September 22-26, 2013 Valletta, Malta 1

2

Where can you find? 3

Where can you find? 4

Where can you find? 5

Where can you find? 6

Research Question Problem We need to profile the web archives around the world with these characteristics: o Age o Top-level domains o Languages o Growth rateGoal To optimize the query routing for Memento Aggregator. To determine the missing parts of the web. 7

Web Archives under this Experiment Full textURI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √√ UK National Library √√ Portuguese Web Archive √√ Web Archive of Catalonia √√ Croatian Web Archive √√ Archive of the Czech Web √√ National Taiwan University √√ Archive It √√ 8

Experiment Sampling from different sources Retrieve the TimeMap from each archive Analyze the TimeMaps 9

URIs Samples Sources Web 1.DMOZ – Random sample 2.DMOZ – TLD 2% of each TLD from DMOZ (.com,.org,.jp, etc 52 TLD) 3.DMOZ – Languages 100 URIs for each Languages (24 lang.) Web Archives 4.Top 1-Gram from Bing 5.Top 1000 queries term by Yahoo in 9 languages User requests 6.IA Wayback Machine Log files 7.Memento aggregator log files * We used hostnames only 10

URIs Samples Sources Web 1.DMOZ – Random sample o 10,000 URIs randomly sample from DMOZ directory (~5M URIs). 2.DMOZ – TLD 2% for each TLD from DMOZ or 100 URIs which are greater o 52 TLDs ( com 23,470) ( de 6,332), ( org 4,025), ( uk 3,309), ( net 2,073), ( it 1,775), ( jp 1379), ( ru 1244), ( fr 1154), ( pl 1062), ( au 764), ( ca 642), ( at 438), ( edu 390), ( cz 385), ( tr 334), ( info 319), ( cn 278), ( us 266), ( nz 265), ( es 238), ( ar 213), ( no 150), ( br 149), ( tw 141), ( za 118), ( fi 113), ( 100 URIS for [ ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw ]) 3.DMOZ – Languages 100 URIs for each Languages o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish, Russian, Turkish, Ukrainian 11

URIs Samples Sources Web Archive Query the fulltext search interface for the web archives with two set of query terms. 4.Top 1-Gram from Bing o Most of them is English 5.Top 1000 queries term by Yahoo in 9 languages o We excluded the general keywords such as: Obama, Facebook. 12

URIs Samples Sources Web Archive Chinese English French German Italian Japanese Korean Portuguese Spanish Total Top 1 Gram Archive with FullText search AIT BL CAN CR CZ CAT PO TW UK

URIs Samples Sources User requests Sampling from the users requests to the web archived materials 6.Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, Sample from Memento aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to

General Coverage 15

Web Archive Growth Rate 16

TLD Sample Coverage 17

TLD per archive (TLD Sample) 18

TLD per archive (Fulltext search) 19

TLD across archives 20

Languages distribution per archive 21

Query Routing Evaluation 22

Conclusions New automatic technique to profile the web archive using the available interface. Internet Archive provide broad coveage. National archives have good coverage for their domains. The evaluation showed that we can retrieve the full TimeMap in 84% using only 3 archives. 23