Profiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May.

Slides:



Advertisements
Similar presentations
Dublin Core in Multiple Languages Thomas Baker Sixth Dublin Core Workshop Library of Congress, Washington DC Tuesday, 3 November 1998.
Advertisements

A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Translation Strategy and Roadmap CCNA Discovery CCNA Exploration ITE:
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
USERDEVELOPERADVERTISER.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Translation Strategy and Roadmap CCNA Discovery CCNA Exploration ITE:
Lund University Libraries Head Office Update on International Seminar on Open Access for Developing Countries – Salvador, Bahia – Brazil September 21st-22.
Funded under the EU ICT Policy Support Programme Automated Solutions for Patent Translation John Tinsley Project PLuTO WIPO Symposium of.
Collecting Primary Language Information LINKED-DISC - provincial database system for early childhood intervention Services Herb Chan.
Advanced Auto attendant v3.0. December 2003 Page 2 New Auto Attendant Features for 3.0 Allow different languages on different dialogs New Language support.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Why do we study English? Form 9, unit 6.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
1 QUESTEL ORBIT.COM. 2 QUESTEL French company Producer and provider of online and internet services Collection of patents, trademarks, designs, scientific-technical.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
I18n BOF Raúl E. Mengod López Universidad Politécnica de Valencia.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
What is a Web Address?. What’s in a name? The URL (uniform resource locator) is just a technical word that means the address to a web page on the WWW.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
IATE EU tool for translation-oriented terminology work
Introducción WEB Diseño y programacion en HTML.
3.0 Features for the MX Voice Mail System. Page 2 Localization Multiple language support for voice mail prompts English (UK) English (USA) Polish German.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
North American Profile: Partnership across borders. Sharon Shin, Metadata Coordinator, Federal Geographic Data Committee Raphael Sussman; Manager, Lands.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
DLF Forum Nov OCLC Grid Services Roy Tennant Senior Program Officer OCLC Research EVERY CONNECTION has a starting point.
Rome May World demand trend Agricultural tractors Millions US $ Italian Institute for Foreign Trade.
Copyright © IBM Corp., The Eclipse™ Babel Project Translation Server Kit Lo IBM™ Corporation.
New RCLayout. Do product layout 3 improvements All products Local databases New functionalities.
Content Mgmt Services eText Overview Digital Delivery Aug 7, 2012.
(Electronic Mail) Most popular use of Internet technology Advantages Disadvantages Setting up an account Your account –User id and password.
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.
5 th EI World Congress - Berlin, July 2007 Use of the Web and Internet Technologies to enhance Teacher Union Work.
© 2012 IBM Corporation Introducing IBM Cognos Insight.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
Look of the new IPPOG Resources database website Proposal by BG + HP based on structure proposed (BG+RL+HP) 2/11/2015 Following and evolving from the discussion.
WISER Humanities: Quality Information on the Internet Johanneke Sytsema Linguistics Subject Consultant Judy Reading Reader.
1. Internet hosts:  IP address (32 bit) - used for addressing datagrams  “name”, e.g., ww.yahoo.com - used by humans DNS: provides translation between.
Council on the World Stage John Ellis (King’s College London) Formerly advisor to CERN DGs on relations with Non-Member States ‘Science for Peace’ Scientific.
ICESat/GLAS Status at NSIDC Doug Fowler NSIDC Product Team Lead PoDAG Oct , 2006.
New Generation Data Protection Powered by Acronis AnyData Technology Support options by Business Units Support languages, live media, availability.
Factiva.com. What is Factiva? Joint venture between two of the world’s leading sources of company and business news + Knight Ridder Media General Hoover’s.
Special Features: Colour: Blue/ Grey Shock resistant USB 3.0 mobile solid state drive Integrated USB 3.0 cable, no additional cables needed IP55 – dust.
The Global IP Portal PIAC China 2010 – Sept. 9 th Laetitia Aymonin, Asian Development Director – Questel Ruiling Hou, International Affairs Investigator.
Languages of Europe Romance, Germanic, and Slavic.
2.2 Internet Basics.
IS1500: Introduction to Web Development
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Profiling Web Archives
Chapter 9: Domain Name Servers
Who and What Links to the Internet Archive
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Oracle Supplier Management Solution Product Availability
What is Internet Internet is a network of networks, linking computers to computers. Each runs software to provide or “serve” information and/or to access.
Digital Asset Management Part 11: Access

Web archive data and researchers’ needs: how might we meet them?
Web Server Technology Unit 10 Website Design and Development.
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
Claro ScanPen Reader By Claro Software Limited
Lars Björnshauge, Lund University Libraries

Active AI Projects at WIPO
Presentation transcript:

Profiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May 21,

Where's that issue with the Afghan girl?

7

8

9

Prior IIPC Memento Aggregator Project Ten IIPC archives, led by LANL Conceived at 2011 IIPC meeting Results reported at 2012 IIPC meeting o Two highlights:

Stop and Rethink… LANL's processing was informative from a "big data" perspective, but was neither scalable nor sustainable o "send us your CDX" == hard for both parties o there are lots of URIs in the world Will only get worse with: o more archives… o …doing more archiving

Leverage Memento Aggregators Memento aggregator currently broadcast URI lookups to all known archives New approach: 1.build profiles based on sampling from URI lookups (optionally supplement with CDX files when available) 2.Use archive profiles for informing Memento aggregator "query routing" decisions 3.Share serialized profiles with other IIPC partners

Profiling Studies TPDL 2013 o 12 archives, March 2013, public web archives used but techniques apply generally o sampling only, no CDX access IJDL 2014 (to appear) o 15 archives (+4, -1), October 2013 o slightly larger sample URI dataset o results similar

URI Lookup = Limited Information 16 GET /aggr/timegate/ HTTP/1.1 Host: mementoproxy.lanl.gov Accept-Datetime: Sun, 29 May :46:53 GMT Accept-Language: fr; q=1.0, en; q=0.5 … 1.Original URI 2.Memento-Datetime 3.Preferred URI 2 1 3

Where to find Mementos for … 17

Where to find Mementos for … 18

Where to find Mementos for … 19

Where to find Mementos for … 20

Research Question Problem Profile public web archives according to the following dimensions: o Top-level domains o Languages o Growth rate o Archival date Motivation Determine who is archiving what Optimize query routing for a Memento Aggregator 21

Web Archives in this Experiment Full textURI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library√√ UK National Library√√ Portuguese Web Archive√√ Web Archive of Catalonia√√ Croatian Web Archive√√ Archive of the Czech Web√√ National Taiwan University√√ Archive It√√ 22

Experiment Set Up Sample URIs from seven different sources Retrieve the TimeMap for each URI from all archives o A TimeMap lists all Mementos for a given URI o A Memento is an archived version of a resource Analyze who has holdings for which URIs 23

Sampling URIs Web 1.DMOZ:Random 2.DMOZ:TLD - 2% of each TLD from DMOZ (.com,.org,.jp, etc 52 TLD) 3.DMOZ:Languages URIs for each Languages (24 lang.) Web Archives Full Text 4.Top 1-Gram from Bing 5.Top 1000 queries term by Yahoo in 9 languages User requests 6.IA Wayback Machine Log files 7.Memento aggregator log files 24

Sampling URIs - DMOZ 1.DMOZ:Random o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs). 2.DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greater o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw]) 3.DMOZ:Languages URIs for each language o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish, Russian, Turkish, Ukrainian 25

Query the fulltext search interface of select web archives with two sets of query terms. 4.Top 1-Gram from Bing o Most are English 5.Top 1000 query terms from Yahoo in 9 languages o Excluding general keywords such as: Obama, Facebook. 26 Sampling URIs – Web Archives Full Text

Chinese English French German Italian Japanese Korean Portuguese Spanish YahooBing Archive with FullText search AIT BL CAN CR CZ CAT PO TW UK Sampling URIs – Web Archives Full Text

Chinese English French German Italian Japanese Korean Portuguese Spanish YahooBing Archive with FullText search AIT BL CAN CR CZ CAT PO TW UK Sampling URIs – Web Archives Full Text

Sampling URIs – User Requests Sampling from user requests for archived web resources 6.Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, Sample from Memento Aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to

Archive Coverage per Sample %100% 35%35% Entire Sample

TLD Coverage across Archives (1) 31 Entire Sample

TLD Coverage across Archives (2) 32 Entire Sample

TLD Distribution per Archive 33 DMOZ:TLD Sample

TLD Distribution per Archive 34 Web Archives Full Text Sample

Language Coverage per Archive 35 DMOZ Sample

Archive Growth Rate 36 Entire Sample

Query Routing Evaluation 37

Study Results Introduced sampling to profile web archives using available infrastructure, no privileged access Coverage: o Internet Archive provides broad coverage o National archives have good coverage for their domains o Surprising coverage by certain archives Query Routing: o In 84% of the cases, all existing Mementos for a TLD can be found by using IA and two additional top archives for a TLD o In 55% of the cases, all existing Mementos for a TLD can be found by using the top 3 archives for a TLD, excluding IA 38

Next Steps With the IIPC 39 Finding the right granularity o too fine: o too coarse:.fr o just right?: bnf.fr, gallica.bnf.fr, Generating profiles o what are desirable / representative sample sets: domains, languages, regions, etc. -- what's missing? o local CDX analysis tools (can help with cold start problem) Profile format o community input (yet another metadata format) o github (or other tools) for exchange & integration

{"Profile":{ "Name":"Taiwan Web Archive", "URI":" "TimeGate": " "Code":"TW", "Age":"Tue, 15 Jul :00:00 GMT", "TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04}, {"eg":0.04},{"gov":0.04},{"my":0.04}, {"jp":0.04},{"kr":0.02}], "Language":[{"zh-TW":0.5},{"zh-CN":0.25}, {"id":0.08},{"ar":0.08}], "GrowthRate":[ {"199707":[4,4]},{"200202":[1,1]}, {"200607":[30,62]},{"200608":[20,80]}, {"200609":[5,9]},{"200612":[77,129]},... // other values truncated {"201308":[7,94]},{"201309":[2,94]}] } A Possible Serialization

{Light, Dim, Dark} Archives 43 Work to date has assumed light archives because our focus has been on sampling archives we don't control Applicable to a continuum of archives: o download/fork and run "dark-sample.py" o it accesses sample URIs from IIPC github o issues URI lookups to local archive o write/update your archive profile in IIPC github with machine readable IP restrictions o all profiles -- light/dim/dark -- now available to Memento aggregators and other IIPC analysis tools

Profiles = Easy Discovery, Sharing