Download presentation
Presentation is loading. Please wait.
Published byJody Farmer Modified over 9 years ago
1
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013 JULY 25 – 26, 2013 INDIANAPOLIS, INDIANA USA
2
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 2
3
Joint Conference on Digital Libraries (JCDL) 2013 A FABLE FROM WAYBACK 7/26/13Scott G. Ainsworth Michael L. Nelson 3
4
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 4 2005-05- 14 01:36:08 +9 days +18 days +7 months +2.1 years
5
Joint Conference on Digital Libraries (JCDL) 2013 QUESTIONS How much temporal spread exists in composite mementos? How can temporal spread be minimized? What factors contribute, positively or negatively, to spread? Does combining multiple archives produce better results? Would users with differing goals benefit from different minimization policies and heuristics? How can temporal coherence be displayed to users—simply? 7/26/13Scott G. Ainsworth Michael L. Nelson 5
6
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 6
7
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Control Crawl Data Quality, Future collections Spaniol et al. – crawling strategy Denev et al. – change rates by MIME type and depth Ben Saad et al. – metadata from crawl used to select best results from archive Our Focus: Existing Data Quality Existing collections Datetime selection policies 7/26/13Scott G. Ainsworth Michael L. Nelson 7
8
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Use Patterns AlNoamony et al. – Archive Access Patterns Humans vs. Robots Dip, dive, slide, & skim Identifying Duplicates Simple identity – images, other binary formats direct comparison Hash comparison HTML, CSS (text) Shingling, Jaccard distances, etc. SimHash most promise 7/26/13Scott G. Ainsworth Michael L. Nelson 8
9
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK – MEMENTO* HTTP extension for datetime negotiation Request Response 7/26/13Scott G. Ainsworth Michael L. Nelson 9 GET /http://www.cs.odu.edu/ HTTP/1.1 … Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT … HTTP/1.1 200 OK … Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT … * https://datatracker.ietf.org/doc/draft-vandesompel-memento/ https://datatracker.ietf.org/doc/draft-vandesompel-memento/
10
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 10
11
Joint Conference on Digital Libraries (JCDL) 2013 HOW MUCH IS ARCHIVED? 7/26/13Scott G. Ainsworth Michael L. Nelson 11 35 – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Internet Archive Search Engine Other
12
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 12
13
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT Comparing two policies Sliding –target datetime changes Sticky – target datetime held steady 7/26/13Scott G. Ainsworth Michael L. Nelson 13
14
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 14 2005-05-14 01:36:08
15
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 15 2005-04-22 00:17:52
16
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 16 2005-03-31 09:16:10
17
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT WHAT WE EXPECTED 2005-05-14 @ 01:36:08 WHAT WE GOT 2005-03-31 @ 09:16:10 7/26/13Scott G. Ainsworth Michael L. Nelson 17
18
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET What if the target is held steady? (Enabled by Memento API) 7/26/13Scott G. Ainsworth Michael L. Nelson 18
19
Joint Conference on Digital Libraries (JCDL) 2013 2005-05-14 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 19 MementoFox Extension 2005-05-14 01:36:08
20
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 20 2005-04-22 00:17:52
21
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 21 2005-05- 14 01:36:08
22
Joint Conference on Digital Libraries (JCDL) 2013 DRIFT COMPARISON Page SlidingSticky DatetimeDriftDatetimeDrift CS Home 2005-05-14 01:36:08 – 2005-05-14 01:36:08 – Science Home 2005-04-22 00:17:52 22.1 days 2005-04-22 00:17:52 22.1 days CS Home 2005-03-31 09:16:10 43.7 days (+21.6 days) 2005-05-14 01:36:08 – Mean32.9 days11.0 days 7/26/13Scott G. Ainsworth Michael L. Nelson 22
23
Joint Conference on Digital Libraries (JCDL) 2013 MEDIAN DRIFT BY STEP ● Sliding ● Sticky Median Drift (months) 7/26/13Scott G. Ainsworth Michael L. Nelson 23 Step Number JCDL’13
24
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 24
25
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 25
26
Joint Conference on Digital Libraries (JCDL) 2013 COMPOSITE MEMENTO PRESENTATIONSTRUCTURE 7/26/13Scott G. Ainsworth Michael L. Nelson 26
27
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 27 2005-05- 14 01:36:08 +9 days +18 days +7 months +2.1 years
28
Joint Conference on Digital Libraries (JCDL) 2013 EMBEDDED RESOURCES ResourceMemento-DatetimeDeltaResource Memento- Datetime Delta http://www.cs.odu.edu2005-05-14 01:36:08spacer.gif2005-06-01 16:23:1018.6 d mm_menu.js2005-05-23 02:39:129.0 djimcheng.gif2005-06-01 16:37:3918.6 d style.css2005-05-23 02:39:399.0 djsmith.gif2005-06-01 16:58:5018.6 d gfx-logo-odu-crown.gif2005-05-23 02:39:399.0 drmenu_1st_featured_alumni.png2005-06-01 21:21:4518.8 d ddmenu_ddown.js2005-05-23 02:39:439.0 dhmenu_college_...-new.png2005-12-21 20:14:257.3 mo university.js2005-05-23 02:39:569.0 drmenu_1st_upcoming_news.png2005-12-21 20:15:147.3 mo rmenu_1st_about.png2005-06-01 13:40:2518.5 drmenu_1st_upcoming_events.png2005-12-21 21:01:127.3 mo rmenu_bottom_229.gif2005-06-01 14:07:2918.5 dlmenu_1st_resources.png2005-12-28 17:47:417.5 mo shadow-bl.gif2005-06-01 14:55:5318.6 dbullet_blue_triangle.gif2005-12-28 19:43:487.5 mo ecsbdg.jpg2005-06-01 14:56:1718.6 dlogo-cs.gif2005-12-28 19:54:297.5 mo shadow-br.gif2005-06-01 15:18:1818.6 drmenu_1st_featured_student.png2007-06-12 02:36:072.1 years gfx-btn-go-dblue.gif2005-06-01 15:34:1918.6 dshadow-b.gif2007-06-21 02:35:172.1 years shadow-tr.gif2005-06-01 15:55:5718.6 dshadow-r.gif404 Not Found header-right1.gif2005-06-01 16:06:1618.6 d 7/26/13Scott G. Ainsworth Michael L. Nelson 28 Embedded Resources26 Mean Delta125.9 days Standard Deviation207.7 days Spread2.1 years
29
Joint Conference on Digital Libraries (JCDL) 2013 REPRESENTING SPREAD COMPOSITE MEMENTO TEMPORAL SPREAD CHART 7/26/13Scott G. Ainsworth Michael L. Nelson 29 Root Embedded Diff. Domain Reused
30
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD – ODU CS 7/26/13Scott G. Ainsworth Michael L. Nelson 30
31
Joint Conference on Digital Libraries (JCDL) 2013 FIRST EXPERIMENT 1,000 URIs from DMOZ (Open Directory) Download all timemaps Download all composite mementos Download all embedded resources Single and Multiple Archives Four Heuristics 7/26/13Scott G. Ainsworth Michael L. Nelson 31
32
Joint Conference on Digital Libraries (JCDL) 2013 PRELIMINARY RESULTS CountDescriptionPercent 1,000Root URI-Rs 910Root timemaps91% 87,847Root URI-Ms in timemaps 96.5URI-Ms per Root URI-R 85,570Root memento downloaded97% 1,488,420Embedded URI-Rs 17.4Embedded URI-Rs per Root memento 7/26/13Scott G. Ainsworth Michael L. Nelson 32
33
Joint Conference on Digital Libraries (JCDL) 2013 SINGLE/MULTI & HEURISTICS DescriptionMinimize Distance, Single Archive Minimize Distance, Multi- Archive 3-Month Window, Multi- Archive Embedded URI-Rs1,488,4401,488,4201,447,351 Embedded URI-Ms in timemaps1,169,7871,186,456500,541 URI-M/Embedded URI-R0.790.800.35 % Complete73.8%75.4%33.8% Mean spread200.2200.115.1 Standard Deviation219.2219.914.3 7/26/13Scott G. Ainsworth Michael L. Nelson 33
34
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 34 1 Memento, Bracketed Root
35
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 35 1 Memento, Bracketed Root
36
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 36 1 Memento, Bracketed Root
37
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 37 1 Memento, Root Not Bracketed
38
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 38 1 Memento, Root Not Bracketed
39
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 39 1 Memento, No Last-Modified
40
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 40 1 Memento, Before Root
41
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 41 2 Mementos, Root Not Bracketed
42
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 42 2 Mementos, Root Not Bracketed
43
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 43 2 Mementos, Use Content – Similarity
44
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 44 2 Mementos, Contents Equal or Equivalent
45
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 45 2 Mementos, Contents Not Equal or Equivalent
46
Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 4,000 URIs from JCDL’11 “How Much…” paper 1 URI/month vice all Temporal coherence patterns Target WSDM 2013 7/26/13Scott G. Ainsworth Michael L. Nelson 46
47
Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 7/26/13Scott G. Ainsworth Michael L. Nelson 47
48
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 48
49
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Timemaps, Redirection, Missing Mementos Timemaps only tell part of the story URI-R redirection (302 from source) URI-M redirection (Archive action) Mementos in timemaps but not accessible Policies must consider user needs Leave it missing Show “best” substitute 7/26/13Scott G. Ainsworth Michael L. Nelson 49
50
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Similarity & Duplication Delta are currently | root – embedded | If bracketing mementos are identical, should delta be zero? HTML is usually modified by the archive Can’t check for equality Shingling? SimHash? 7/26/13Scott G. Ainsworth Michael L. Nelson 50 0 +30d –30d
51
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Communicating Status 7/26/13Scott G. Ainsworth Michael L. Nelson 51
52
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Policies & Heuristics Current Spread Heuristics Minimize distance Past only Past preferred Near or within distance Single vs. multi-archive Refine to meet user expectations Speed (minimize time) Accuracy (minimize temporal error) 7/26/13Scott G. Ainsworth Michael L. Nelson 52
53
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 53
54
Joint Conference on Digital Libraries (JCDL) 2013 CONCLUSION Extensive research on improving acquisition exists Best use of existing collections needs study We are looking at Characterizing existing holdings Characterizing temporal coherence Policies that minimize impact of temporal incoherence Visualizations of temporal coherence 7/26/13Scott G. Ainsworth Michael L. Nelson 54
55
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 55 Coherent
56
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 56 Violation
57
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 57 What do these mean to users? (3) (2) (1) (4)
58
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 58 What does this mean to users?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.