Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein}

Slides:



Advertisements
Similar presentations
Welcome to Middleware Joseph Amrithraj
Advertisements

Unit 1: Module 1 Objective 10 identify tools used in the entry, retrieval, processing, storage, presentation, transmission and dissemination of information;
1 COMM 1213 H1 COMP 4923 X1 HTML Page Structure and Navigation (Readings: Ch. 5 Lazar)
SE 370: Programming Web Services Week 4: SOAP & NetBeans Copyright © Steven W. Johnson February 1, 2013.
Using OAI-PMH for Resource Exchange OAI Metadata Harvesting Workshop, JCDL 03 Michael L. Nelson, Terry L. Harrison Old Dominion University Norfolk VA
The Open Archives Initiative DRIADE Workshop, Durham NC, May 16-17, 2007 Michael L. Nelson The Open Archives Initiative Michael L. Nelson Computer Science,
1 Configuring Internet- related services (April 22, 2015) © Abdou Illia, Spring 2015.
Depositing e-material to The National Library of Sweden.
How Clients and Servers Work Together. Objectives Web Server Protocols Examine how server and client software work Use FTP to transfer files Initiate.
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
1 CS 502: Computing Methods for Digital Libraries Lecture 22 Repositories.
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
Boris Tshibangu. What is a proxy server? A proxy server is a server (a computer system or an application) that acts as an intermediary for requests from.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.
Computer Concepts 2014 Chapter 7 The Web and .
PHOTOSWAP Albert Park & Brandon Ochs. What is PhotoSwap?  Social networking platform for iOS  Users share images with each other  Extract sensor data.
DNN Performance & Scalability Planning, Evaluating & Improving : Part 2.
Computer Networking From LANs to WANs: Hardware, Software, and Security Chapter 12 Electronic Mail.
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
CH2 System models.
HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,
Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.
Client – Server Architecture. Client Server Architecture A network architecture in which each computer or process on the network is either a client or.
An Overview of the Internet: The Internet: Then and Now How the Internet Works Major Features of the Internet.
Standards And Architectures For NOF Digitisation Projects Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: .
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.
Network Monitoring System for the UNIX Lab Bradley Kita Capstone Project Mentor: Dr C. David Shaffer Fall 2004/Spring 2005.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
The Web and Web Services Jim Graham NR 621 Spring 2009.
XP New Perspectives on The Internet, Fifth Edition— Comprehensive, 2005 Update Tutorial 7 1 Mass Communication on the Internet Using Newsgroups Tutorial.
TCP/IP (Transmission Control Protocol / Internet Protocol)
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
National Library of the Czech Republic as End-User of the Research Networks Adolf Knoll deputy director
2007cs Servers on the Web. The World-Wide Web 2007 cs CSS JS HTML Server Browser JS CSS HTML Transfer of resources using HTTP.
The Module Road Map Assignment 1 Road Map We will look at… Internet / World Wide Web Aspects of their operation The role of clients and servers ASPX.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Web Server.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Homework Explain these terms in your own words. If they are not your own words, they need to be in quotes with a source given. Sources 1.Wilson 2.Webopedia.
Berkeley Sockets The socket primitives for TCP.. PortProtocol Use 21 FTP File transfer 23 Telnet Remote login 25 SMTP 69 TFTP Trivial File Transfer.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Web Cache. What is Cache? Cache is the storing of data temporarily to improve performance. Cache exist in a variety of areas such as your CPU, Hard Disk.
Technical Report 4th CERN Workshop of Innovations in Scholarly Communication (OAI4)
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
VIRTUAL SERVERS Chapter 7. 2 OVERVIEW Exchange Server 2003 virtual servers Virtual servers in a clustering environment Creating additional virtual servers.
Dr. Adil Yousif University of Alneelian – Master of CS - IT Electronic Mail.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Computer Basics Introduction CIS 109 Columbia College.
The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.
Can’t Find Your 404s? Santa Fe Complex March 13, 2009 Martin Klein, Frank McCown, Joan Smith, Michael L. Nelson Department of Computer Science Old Dominion.
3.1 Types of Servers.
3.1 Types of Servers.
Network Components Network Interface Card (NIC) Hub and Switches
3.1 Types of Servers.
Direct Internet 3 Iridium Proprietary and Confidential 9/18/2018.
Lazy Preservation, Warrick, and the Web Infrastructure
Just-In-Time Recovery of Missing Web Pages
Characterization of Search Engine Caches
Presentation transcript:

Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA DLF Spring 2006 Austin TX April 10-12, 2006

Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” image from: Five Easy Steps for Preservation:

Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Find a “good enough” replacement web page Web Server Based Preservation –Use Apache modules to create archival-ready resources Shared Infrastructure Preservation –Push your content to sites that might preserve it image from:

Shared, Existing Infrastructure Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress?

Experiment & Simulation Inject the contents of an OAI-PMH repository directly into: – (SMTP) –Usenet News (NNTP) Instrument existing , news servers Use mod_oai ( to do resource harvesting –complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” –results are generalizable to any repository system Analyze testbed, simulate very large collections

Test Repository Website with 72 files –HTML, PDF, PNG, JPEG, GIF –1KB MB Used a script to harvest the MPEG-21 DIDLs, and then: –attach to outbound mesgs –post to a moderated newsgroup (repository.odu.test1)

General Architecture

Adding Attachments / Headers outgoing mail incoming mail

Headers OAI-PMH & HTTP headers base64 encoded DIDL original mesg

SMTP Overhead ~ 1 sec penalty per mesg diminishing returns for skipping mesgs

mail.cs.odu.edu 30 days of traffic –505,987 mesgs –4081 unique hosts –daily mean: 16,866 std dev: 5147 P(x) = a(x -b ) we measured b≈1.6

News

News Posting OAI-PMH & HTTP headers base64 encoded DIDL

News Overhead

News Policies

Simulation Parameters Repository –100,000 items –1MB/item –100 daily additions –400 daily updates Time –2000 days (5.5 years) –granularity=1 –follows ODU power law example News –servers hold contents for 30 days

NNTP Results

Results (Without Memory)

Results (With Memory)

Discussion We’ve examined the worst case scenario –large, active repository –sending contents by-value Optimizations / Alternatives –smaller, less dynamic repositories –sending contents by-reference –use for repository discovery, not for content interchange instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters

Summary Shared, existing infrastructure can be used to push content to unknown preservation partners –exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices –Gmail, Google Groups, etc. will always have more disks than you…