Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,

Slides:



Advertisements
Similar presentations
Downloading Textual Hidden-Web Content Through Keyword Queries
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Towards Building a Collection of Web Archiving Research Articles Brenda Reyes Ayala and Cornelia Caragea University of North Texas
1 SMS at the University of Hong Kong Libraries William Ko, HKU Libraries Dr Frank Tong, ETI, HKU.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Open Annotation Collaboration Rob Sanderson, Herbert Van de Sompel DMSS Meeting, May 14-15, Stanford, CA Robert Sanderson –
Prototypes of pro-active approaches to support the archiving of web references for scholarly communications Richard Wincewicz 1, Peter Burnhill 1 & Herbert.
EVALUATING THE TEMPORAL COHERENCE OF ARCHIVED PAGES SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY IIPC 2015 APRIL 27 – MAY 1, 2015 STANFORD.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
SEO Webinar - With Neil Palmer of IM3.co.uk In Partnership with Huddlebuy How do I improve my website traffic with SEO? Covering: What is SEO? Why is SEO.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Web Characterization: What Does the Web Look Like?
Memento Update CNI Task Force Meeting, Spring Memento Herbert Van de Sompel Robert Sanderson Michael L. Nelson Giant Leaps.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Persistent Annotations Deserve New URIs Abdulla Alasaadi Old Dominion University Michael L. Nelson JCDL 2011 Ottawa,
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Searching Information. General Steps Identifying Key Words, Synonyms, and Key Phrases Constructing an effective search statement Advance search/boolean.
Visualizing Digital Collections at Archive-It Michele C. Weigle, Michael L. Nelson Web Sciences and Digital Libraries (WS-DL) Lab Department of Computer.
Profiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
Google Pointers to Make You Ogle Connie Ury, B.D. Owens Library.
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
NSDL October 12-15, 2003Eisenhower National Clearinghouse Slide 1 NSDL and the Open Archives Initiative NSDL – OAI – and the Eisenhower National Clearinghouse.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
Hiberlink is funded by the Andrew W. Mellon Foundation Investigating Reference Rot in Web-Based Scholarly Communication Martin Klein Los Alamos National.
Hiberlink is funded by the Andrew W. Mellon Foundation The Missing Link Proposal #hiberlink #memento Herbert.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Networked Information Resources Federated search, link server, e-books.
Resurrecting My Revolution Using Social Link Neighborhood in Bringing Context to the Disappearing Web Hany SalahEldeen & Michael Nelson Resurrecting My.
Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool Justin F. Brunelle Michael L. Nelson Lyudmila Balakireva Robert Sanderson.
Profiling Web Archives
When Should I Make Preservation Copies of Myself?
Who and What Links to the Internet Archive
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Michele C. Weigle and Michael L. Nelson
Lazy Preservation, Warrick, and the Web Infrastructure
Information Integration for Digital Libraries
Agreeing to Disagree: Search Engines and Their Public Interfaces
Introduction to Digital Libraries Assignment #1
Web Server Design Week 4 Old Dominion University
Introduction to Digital Libraries Assignment #3
Visualizing Digital Collections at Archive-It
Characterization of Search Engine Caches
Persistent Annotations Deserve New URIs
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Web Server Design Assignment #5 Extra Credit
Introduction to Digital Libraries Assignment #1
Introduction to Digital Libraries Assignment #1
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle, JCDL 2011, Ottawa, Canada * This work supported in part by the Library of Congress and NSF IIS

Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle, JCDL 2011, Ottawa, Canada * This work supported in part by the Library of Congress and NSF IIS

+8000 copies How Many Archive Copies?

Research Question How Much of the Web is Archived?

4 Sample sets – 1000 URIs each Experiment Sampling Techniques 5

Indexed by: API API + Web Web 49%41%30%54% Sampling Techniques DMOZ HTTP Status# xx  xx  Others50 4xx135 5xx4 Timeout112 All the available DMOZ history: 100 snapshots made from July 20, 2000 through October 3, 2010 Remove duplicates, non-HTTP, invalid URIS Randomly selected 1000 URIs 6

Sampling Techniques Delicious HTTP Status# xx  xx  Others1 4xx8 5xx3 Timeout3 We used Delicious Recent random URI generator ( ) on Nov. 22, 2010 to get 1000 URIs. 7 Indexed by: API API + Web Web 95%86%88%95%

Sampling Techniques Bitly HTTP Status# xx  xx  Others36 4xx197 5xx6 Timeout30 Bitly URI consists of a 1–6 character, alphanumeric hash value appended to Random hash values were created and dereferenced until 1,000 target URIs were discovered 8 Indexed by: API API + Web Web 21%22%24%30%

Sampling Techniques Search Engine HTTP Status# xx  xx  Others3 4xx16 5xx0 Timeout21 Random Sample from Search Engine technique [Bar-Yossef 2008] The phrase pool was selected from the 5- grams in Google’s N-gram data. A random sample of these queries was used to obtain URIs, of which 1,000 were selected at random. 9 Indexed by: API API + Web Web 55%98%70%73%

Experiment 10 For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos).For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos). For each URI, Memento Aggregator responded with TimeMap for this URI.For each URI, Memento Aggregator responded with TimeMap for this URI. Example Example ;rel="first memento";datetime="Sun, 19 Aug :42:33 GMT“, ; rel="memento"; datetime="Sun, 16 Dec :02:48 GMT",

Results 65% 63% 11

We have 3 categories of archives Internet Archive (classic interface)Internet Archive (classic interface) Search engineSearch engine Other archivesOther archives Archive categories UK US 12

Memento Distribution, ordered by the first observation date. Internet Archive Search Engine Others Archives Internet Archive Search Engine Others Archives

Memento Distribution, ordered by the first observation date.

How often are URIs archived?

What affects the archival rate? –URI source (More popular – More archival) –BackLinks (Weak positive relationship) Archive Coverage –Internet Archive (Best) –Search Engine (Good), but only one copy. –The other archives cover only a small fraction with recent copies. Analysis 17

Tell me what is your URI source!! Conclusion How much of the Web is Archived? Including SE cache Excluding SE Cache 90%79% 97%68% 35%16% 88%19% 18 Contact: Ahmed AlSum

Ziv Bar-Yossef and Maxim Gurevich. Random sampling from a search engine’s index. J. ACM, 55(5), Frank McCown and Michael L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, May Herbert Van de Sompel, Michael Nelson, and Robert Sanderson. HTTP framework for time-based access to resource states — Memento, November vandesompelmemento/. References 19

Contact: Ahmed AlSum