TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.

Slides:



Advertisements
Similar presentations
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
Advertisements

Pervasive Web Content Delivery with Efficient Data Reuse Chi-Hung Chi and Cao Yang School of Computing National University of Singapore
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery Mat Kelly Web Science and Digital Libraries Research Lab Old Dominion University.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Online communities 1 Theory revision Complete some of the activities in this powerpoint and use the revision book to answer questions.
EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
 What is it ? What is it ?  URI,URN,URL URI,URN,URL  HTTP – methods HTTP – methods  HTTP Request Packets HTTP Request Packets  HTTP Request Headers.
1 Archive-It Training University of Maryland July 12, 2007.
Basics of HTML. Example Code Hello World Hello World This is a web page.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.
Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Annick Le Follic Bibliothèque nationale de France Tallinn,
A Web Crawler Design for Data Mining
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Visualizing Digital Collections at Archive-It Michele C. Weigle, Michael L. Nelson Web Sciences and Digital Libraries (WS-DL) Lab Department of Computer.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
Challenges for Academic Libraries in the Networked World Christine L. Borgman Professor & Presidential Chair in Information Studies UCLA & Visiting Professor.
Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Science and Digital.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
The New DRS Introduction. What is DRS? Digital repository for preservation and access – Maintains integrity of deposited content – Preserves content for.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
LEARN THE QUICK AND EASY WAY! VISUAL QUICKSTART GUIDE HTML and CSS 8th Edition Chapter 8: Working with Style Sheets.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
May 26-28ICNEE 2003 ARCHON: BUILDING LEARNING ENVIRONMENTS THROUGH EXTENDED DIGITAL LIBRARY SERVICES Hesham Anan, Kurt Maly, Mohammad Zubair,et al. Digital.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
1 More About HTML Images and Links. 22 Objectives You will be able to Include images in your HTML page. Create links to other pages on your HTML page.
1 JCDL 2013 Report Kazunari Sugiyama WING meeting 23 rd August, 2013.
A Mobile Library Management System Advisor: Dr. Shen Student: Ananta Gampaa November 8 th,2005.
Introduction to the World Wide Web & Internet CIS 101.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
What is Seo? SEO stands for “search engine optimization.” It is the process of getting traffic from the “free,” “organic,” “editorial” or “natural” search.
Introduction to Digital Libraries Assignment #1 Old Dominion University Department of Computer Science CS 751/851 Spring 2015 Michael L. Nelson 01/22/15.
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
BOF-1147, JavaTM Technology and WebDAV: Standardizing Content Management Java and WebDAV Juergen Pill Team Leader Software AG Remy Maucherat Software Engineer.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
How HTTP Works Made by Manish Kushwaha.
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Introduction to Information Retrieval Week 1: Administrivia
James Bond and the Land of Always Winter (or, how to structure a dissertation, thesis, or candidacy proposal) Michael L. Nelson & Michele C. Weigle Old.
Lazy Preservation, Warrick, and the Web Infrastructure
Agreeing to Disagree: Search Engines and Their Public Interfaces
Introduction to Digital Libraries Assignment #3
Visualizing Digital Collections at Archive-It
RcLIS towards a Digital Library for Information Science
Characterization of Search Engine Caches
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
If You Harvest arXiv.org, Will They Come?
Introduction to Digital Libraries Assignment #4
Introduction to Digital Libraries Assignment #2
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #1
Introduction to Digital Libraries Assignment #2
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013 JULY 25 – 26, 2013 INDIANAPOLIS, INDIANA USA

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 2

Joint Conference on Digital Libraries (JCDL) 2013 A FABLE FROM WAYBACK 7/26/13Scott G. Ainsworth Michael L. Nelson 3

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08 +9 days +18 days +7 months +2.1 years

Joint Conference on Digital Libraries (JCDL) 2013 QUESTIONS How much temporal spread exists in composite mementos? How can temporal spread be minimized? What factors contribute, positively or negatively, to spread? Does combining multiple archives produce better results? Would users with differing goals benefit from different minimization policies and heuristics? How can temporal coherence be displayed to users—simply? 7/26/13Scott G. Ainsworth Michael L. Nelson 5

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 6

Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Control Crawl Data Quality, Future collections Spaniol et al. – crawling strategy Denev et al. – change rates by MIME type and depth Ben Saad et al. – metadata from crawl used to select best results from archive Our Focus: Existing Data Quality Existing collections Datetime selection policies 7/26/13Scott G. Ainsworth Michael L. Nelson 7

Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Use Patterns AlNoamony et al. – Archive Access Patterns Humans vs. Robots Dip, dive, slide, & skim Identifying Duplicates Simple identity – images, other binary formats direct comparison Hash comparison HTML, CSS (text) Shingling, Jaccard distances, etc. SimHash most promise 7/26/13Scott G. Ainsworth Michael L. Nelson 8

Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK – MEMENTO* HTTP extension for datetime negotiation Request Response 7/26/13Scott G. Ainsworth Michael L. Nelson 9 GET / HTTP/1.1 … Accept-Datetime: Sat, 10 May :21:00 GMT … HTTP/ OK … Memento-Datetime: Sat, 14 May :36:08 GMT … *

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 10

Joint Conference on Digital Libraries (JCDL) 2013 HOW MUCH IS ARCHIVED? 7/26/13Scott G. Ainsworth Michael L. Nelson – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Internet Archive Search Engine Other

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 12

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT Comparing two policies Sliding –target datetime changes Sticky – target datetime held steady 7/26/13Scott G. Ainsworth Michael L. Nelson 13

Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08

Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :17:52

Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :16:10

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT WHAT WE EXPECTED 01:36:08 WHAT WE GOT 09:16:10 7/26/13Scott G. Ainsworth Michael L. Nelson 17

Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET What if the target is held steady? (Enabled by Memento API) 7/26/13Scott G. Ainsworth Michael L. Nelson 18

Joint Conference on Digital Libraries (JCDL) STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 19 MementoFox Extension :36:08

Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :17:52

Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08

Joint Conference on Digital Libraries (JCDL) 2013 DRIFT COMPARISON Page SlidingSticky DatetimeDriftDatetimeDrift CS Home :36:08 – :36:08 – Science Home :17: days :17: days CS Home :16: days (+21.6 days) :36:08 – Mean32.9 days11.0 days 7/26/13Scott G. Ainsworth Michael L. Nelson 22

Joint Conference on Digital Libraries (JCDL) 2013 MEDIAN DRIFT BY STEP ● Sliding ● Sticky Median Drift (months) 7/26/13Scott G. Ainsworth Michael L. Nelson 23 Step Number JCDL’13

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  How much of the Web is archived  Temporal Drift  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 24

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 25

Joint Conference on Digital Libraries (JCDL) 2013 COMPOSITE MEMENTO PRESENTATIONSTRUCTURE 7/26/13Scott G. Ainsworth Michael L. Nelson 26

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08 +9 days +18 days +7 months +2.1 years

Joint Conference on Digital Libraries (JCDL) 2013 EMBEDDED RESOURCES ResourceMemento-DatetimeDeltaResource Memento- Datetime Delta 01:36:08spacer.gif :23: d mm_menu.js :39:129.0 djimcheng.gif :37: d style.css :39:399.0 djsmith.gif :58: d gfx-logo-odu-crown.gif :39:399.0 drmenu_1st_featured_alumni.png :21: d ddmenu_ddown.js :39:439.0 dhmenu_college_...-new.png :14:257.3 mo university.js :39:569.0 drmenu_1st_upcoming_news.png :15:147.3 mo rmenu_1st_about.png :40: drmenu_1st_upcoming_events.png :01:127.3 mo rmenu_bottom_229.gif :07: dlmenu_1st_resources.png :47:417.5 mo shadow-bl.gif :55: dbullet_blue_triangle.gif :43:487.5 mo ecsbdg.jpg :56: dlogo-cs.gif :54:297.5 mo shadow-br.gif :18: drmenu_1st_featured_student.png :36:072.1 years gfx-btn-go-dblue.gif :34: dshadow-b.gif :35:172.1 years shadow-tr.gif :55: dshadow-r.gif404 Not Found header-right1.gif :06: d 7/26/13Scott G. Ainsworth Michael L. Nelson 28 Embedded Resources26 Mean Delta125.9 days Standard Deviation207.7 days Spread2.1 years

Joint Conference on Digital Libraries (JCDL) 2013 REPRESENTING SPREAD COMPOSITE MEMENTO TEMPORAL SPREAD CHART 7/26/13Scott G. Ainsworth Michael L. Nelson 29 Root Embedded Diff. Domain Reused

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD – ODU CS 7/26/13Scott G. Ainsworth Michael L. Nelson 30

Joint Conference on Digital Libraries (JCDL) 2013 FIRST EXPERIMENT 1,000 URIs from DMOZ (Open Directory) Download all timemaps Download all composite mementos Download all embedded resources Single and Multiple Archives Four Heuristics 7/26/13Scott G. Ainsworth Michael L. Nelson 31

Joint Conference on Digital Libraries (JCDL) 2013 PRELIMINARY RESULTS CountDescriptionPercent 1,000Root URI-Rs 910Root timemaps91% 87,847Root URI-Ms in timemaps 96.5URI-Ms per Root URI-R 85,570Root memento downloaded97% 1,488,420Embedded URI-Rs 17.4Embedded URI-Rs per Root memento 7/26/13Scott G. Ainsworth Michael L. Nelson 32

Joint Conference on Digital Libraries (JCDL) 2013 SINGLE/MULTI & HEURISTICS DescriptionMinimize Distance, Single Archive Minimize Distance, Multi- Archive 3-Month Window, Multi- Archive Embedded URI-Rs1,488,4401,488,4201,447,351 Embedded URI-Ms in timemaps1,169,7871,186,456500,541 URI-M/Embedded URI-R % Complete73.8%75.4%33.8% Mean spread Standard Deviation /26/13Scott G. Ainsworth Michael L. Nelson 33

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 34 1 Memento, Bracketed Root

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 35 1 Memento, Bracketed Root

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 36 1 Memento, Bracketed Root

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 37 1 Memento, Root Not Bracketed

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 38 1 Memento, Root Not Bracketed

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 39 1 Memento, No Last-Modified

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 40 1 Memento, Before Root

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 41 2 Mementos, Root Not Bracketed

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 42 2 Mementos, Root Not Bracketed

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 43 2 Mementos, Use Content – Similarity

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 44 2 Mementos, Contents Equal or Equivalent

Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 45 2 Mementos, Contents Not Equal or Equivalent

Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 4,000 URIs from JCDL’11 “How Much…” paper 1 URI/month vice all Temporal coherence patterns Target WSDM /26/13Scott G. Ainsworth Michael L. Nelson 46

Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 7/26/13Scott G. Ainsworth Michael L. Nelson 47

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Temporal Spread  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 48

Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Timemaps, Redirection, Missing Mementos Timemaps only tell part of the story URI-R redirection (302 from source) URI-M redirection (Archive action) Mementos in timemaps but not accessible Policies must consider user needs Leave it missing Show “best” substitute 7/26/13Scott G. Ainsworth Michael L. Nelson 49

Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Similarity & Duplication Delta are currently | root – embedded | If bracketing mementos are identical, should delta be zero? HTML is usually modified by the archive Can’t check for equality Shingling? SimHash? 7/26/13Scott G. Ainsworth Michael L. Nelson d –30d

Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Communicating Status 7/26/13Scott G. Ainsworth Michael L. Nelson 51

Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Policies & Heuristics Current Spread Heuristics Minimize distance Past only Past preferred Near or within distance Single vs. multi-archive Refine to meet user expectations Speed (minimize time) Accuracy (minimize temporal error) 7/26/13Scott G. Ainsworth Michael L. Nelson 52

Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS  Motivation  Related work  Preliminary work  Future work  Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 53

Joint Conference on Digital Libraries (JCDL) 2013 CONCLUSION Extensive research on improving acquisition exists Best use of existing collections needs study We are looking at Characterizing existing holdings Characterizing temporal coherence Policies that minimize impact of temporal incoherence Visualizations of temporal coherence 7/26/13Scott G. Ainsworth Michael L. Nelson 54

Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 55 Coherent

Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 56 Violation

Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 57 What do these mean to users? (3) (2) (1) (4)

Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 58 What does this mean to users?