TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013 JULY 25 – 26, 2013 INDIANAPOLIS, INDIANA USA
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 2
Joint Conference on Digital Libraries (JCDL) 2013 A FABLE FROM WAYBACK 7/26/13Scott G. Ainsworth Michael L. Nelson 3
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08 +9 days +18 days +7 months +2.1 years
Joint Conference on Digital Libraries (JCDL) 2013 QUESTIONS How much temporal spread exists in composite mementos? How can temporal spread be minimized? What factors contribute, positively or negatively, to spread? Does combining multiple archives produce better results? Would users with differing goals benefit from different minimization policies and heuristics? How can temporal coherence be displayed to users—simply? 7/26/13Scott G. Ainsworth Michael L. Nelson 5
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 6
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Control Crawl Data Quality, Future collections Spaniol et al. – crawling strategy Denev et al. – change rates by MIME type and depth Ben Saad et al. – metadata from crawl used to select best results from archive Our Focus: Existing Data Quality Existing collections Datetime selection policies 7/26/13Scott G. Ainsworth Michael L. Nelson 7
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK Use Patterns AlNoamony et al. – Archive Access Patterns Humans vs. Robots Dip, dive, slide, & skim Identifying Duplicates Simple identity – images, other binary formats direct comparison Hash comparison HTML, CSS (text) Shingling, Jaccard distances, etc. SimHash most promise 7/26/13Scott G. Ainsworth Michael L. Nelson 8
Joint Conference on Digital Libraries (JCDL) 2013 RELATED WORK – MEMENTO* HTTP extension for datetime negotiation Request Response 7/26/13Scott G. Ainsworth Michael L. Nelson 9 GET / HTTP/1.1 … Accept-Datetime: Sat, 10 May :21:00 GMT … HTTP/ OK … Memento-Datetime: Sat, 14 May :36:08 GMT … *
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 10
Joint Conference on Digital Libraries (JCDL) 2013 HOW MUCH IS ARCHIVED? 7/26/13Scott G. Ainsworth Michael L. Nelson – 90% At least one archived copy 17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11 Internet Archive Search Engine Other
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 12
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT Comparing two policies Sliding –target datetime changes Sticky – target datetime held steady 7/26/13Scott G. Ainsworth Michael L. Nelson 13
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :17:52
Joint Conference on Digital Libraries (JCDL) 2013 SLIDING TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :16:10
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL DRIFT WHAT WE EXPECTED 01:36:08 WHAT WE GOT 09:16:10 7/26/13Scott G. Ainsworth Michael L. Nelson 17
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET What if the target is held steady? (Enabled by Memento API) 7/26/13Scott G. Ainsworth Michael L. Nelson 18
Joint Conference on Digital Libraries (JCDL) STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson 19 MementoFox Extension :36:08
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :17:52
Joint Conference on Digital Libraries (JCDL) 2013 STICKY TARGET 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08
Joint Conference on Digital Libraries (JCDL) 2013 DRIFT COMPARISON Page SlidingSticky DatetimeDriftDatetimeDrift CS Home :36:08 – :36:08 – Science Home :17: days :17: days CS Home :16: days (+21.6 days) :36:08 – Mean32.9 days11.0 days 7/26/13Scott G. Ainsworth Michael L. Nelson 22
Joint Conference on Digital Libraries (JCDL) 2013 MEDIAN DRIFT BY STEP ● Sliding ● Sticky Median Drift (months) 7/26/13Scott G. Ainsworth Michael L. Nelson 23 Step Number JCDL’13
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work How much of the Web is archived Temporal Drift Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 24
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson 25
Joint Conference on Digital Libraries (JCDL) 2013 COMPOSITE MEMENTO PRESENTATIONSTRUCTURE 7/26/13Scott G. Ainsworth Michael L. Nelson 26
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD 7/26/13Scott G. Ainsworth Michael L. Nelson :36:08 +9 days +18 days +7 months +2.1 years
Joint Conference on Digital Libraries (JCDL) 2013 EMBEDDED RESOURCES ResourceMemento-DatetimeDeltaResource Memento- Datetime Delta 01:36:08spacer.gif :23: d mm_menu.js :39:129.0 djimcheng.gif :37: d style.css :39:399.0 djsmith.gif :58: d gfx-logo-odu-crown.gif :39:399.0 drmenu_1st_featured_alumni.png :21: d ddmenu_ddown.js :39:439.0 dhmenu_college_...-new.png :14:257.3 mo university.js :39:569.0 drmenu_1st_upcoming_news.png :15:147.3 mo rmenu_1st_about.png :40: drmenu_1st_upcoming_events.png :01:127.3 mo rmenu_bottom_229.gif :07: dlmenu_1st_resources.png :47:417.5 mo shadow-bl.gif :55: dbullet_blue_triangle.gif :43:487.5 mo ecsbdg.jpg :56: dlogo-cs.gif :54:297.5 mo shadow-br.gif :18: drmenu_1st_featured_student.png :36:072.1 years gfx-btn-go-dblue.gif :34: dshadow-b.gif :35:172.1 years shadow-tr.gif :55: dshadow-r.gif404 Not Found header-right1.gif :06: d 7/26/13Scott G. Ainsworth Michael L. Nelson 28 Embedded Resources26 Mean Delta125.9 days Standard Deviation207.7 days Spread2.1 years
Joint Conference on Digital Libraries (JCDL) 2013 REPRESENTING SPREAD COMPOSITE MEMENTO TEMPORAL SPREAD CHART 7/26/13Scott G. Ainsworth Michael L. Nelson 29 Root Embedded Diff. Domain Reused
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL SPREAD – ODU CS 7/26/13Scott G. Ainsworth Michael L. Nelson 30
Joint Conference on Digital Libraries (JCDL) 2013 FIRST EXPERIMENT 1,000 URIs from DMOZ (Open Directory) Download all timemaps Download all composite mementos Download all embedded resources Single and Multiple Archives Four Heuristics 7/26/13Scott G. Ainsworth Michael L. Nelson 31
Joint Conference on Digital Libraries (JCDL) 2013 PRELIMINARY RESULTS CountDescriptionPercent 1,000Root URI-Rs 910Root timemaps91% 87,847Root URI-Ms in timemaps 96.5URI-Ms per Root URI-R 85,570Root memento downloaded97% 1,488,420Embedded URI-Rs 17.4Embedded URI-Rs per Root memento 7/26/13Scott G. Ainsworth Michael L. Nelson 32
Joint Conference on Digital Libraries (JCDL) 2013 SINGLE/MULTI & HEURISTICS DescriptionMinimize Distance, Single Archive Minimize Distance, Multi- Archive 3-Month Window, Multi- Archive Embedded URI-Rs1,488,4401,488,4201,447,351 Embedded URI-Ms in timemaps1,169,7871,186,456500,541 URI-M/Embedded URI-R % Complete73.8%75.4%33.8% Mean spread Standard Deviation /26/13Scott G. Ainsworth Michael L. Nelson 33
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 34 1 Memento, Bracketed Root
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 35 1 Memento, Bracketed Root
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 36 1 Memento, Bracketed Root
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 37 1 Memento, Root Not Bracketed
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 38 1 Memento, Root Not Bracketed
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 39 1 Memento, No Last-Modified
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 40 1 Memento, Before Root
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 41 2 Mementos, Root Not Bracketed
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 42 2 Mementos, Root Not Bracketed
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 43 2 Mementos, Use Content – Similarity
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 44 2 Mementos, Contents Equal or Equivalent
Joint Conference on Digital Libraries (JCDL) 2013 TEMPORAL COHERENCE 7/26/13Scott G. Ainsworth Michael L. Nelson 45 2 Mementos, Contents Not Equal or Equivalent
Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 4,000 URIs from JCDL’11 “How Much…” paper 1 URI/month vice all Temporal coherence patterns Target WSDM /26/13Scott G. Ainsworth Michael L. Nelson 46
Joint Conference on Digital Libraries (JCDL) 2013 CURRENT EXPERIMENT 7/26/13Scott G. Ainsworth Michael L. Nelson 47
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 48
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Timemaps, Redirection, Missing Mementos Timemaps only tell part of the story URI-R redirection (302 from source) URI-M redirection (Archive action) Mementos in timemaps but not accessible Policies must consider user needs Leave it missing Show “best” substitute 7/26/13Scott G. Ainsworth Michael L. Nelson 49
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Similarity & Duplication Delta are currently | root – embedded | If bracketing mementos are identical, should delta be zero? HTML is usually modified by the archive Can’t check for equality Shingling? SimHash? 7/26/13Scott G. Ainsworth Michael L. Nelson d –30d
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Communicating Status 7/26/13Scott G. Ainsworth Michael L. Nelson 51
Joint Conference on Digital Libraries (JCDL) 2013 FUTURE WORK Policies & Heuristics Current Spread Heuristics Minimize distance Past only Past preferred Near or within distance Single vs. multi-archive Refine to meet user expectations Speed (minimize time) Accuracy (minimize temporal error) 7/26/13Scott G. Ainsworth Michael L. Nelson 52
Joint Conference on Digital Libraries (JCDL) 2013 CONTENTS Motivation Related work Preliminary work Future work Conclusion 7/26/13Scott G. Ainsworth Michael L. Nelson 53
Joint Conference on Digital Libraries (JCDL) 2013 CONCLUSION Extensive research on improving acquisition exists Best use of existing collections needs study We are looking at Characterizing existing holdings Characterizing temporal coherence Policies that minimize impact of temporal incoherence Visualizations of temporal coherence 7/26/13Scott G. Ainsworth Michael L. Nelson 54
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 55 Coherent
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 56 Violation
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 57 What do these mean to users? (3) (2) (1) (4)
Joint Conference on Digital Libraries (JCDL) 2013 MY QUESTIONS 7/26/13Scott G. Ainsworth Michael L. Nelson 58 What does this mean to users?