SCALABLE OPEN ACCESS Hussein Suleman hussein@cs.uct.ac.za University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance Computing Laboratory April 2007
Open Access What is Open Access? Why? free online access to electronic resources: research papers, courseware, ETDs, etc. Lower costs, Empower producers Empower consumers, Improve visibility … How? Institutional Repository: online system to manage documents, typically at one institution. Open Access Journal: online system to manage publication and dissemination. Advocacy, Policy, Procedures, Management, … Software Tools: DSpace, EPrints, OJS, …
Very Large Data Collections Source: http://www2.warwick.ac.uk/fac/sci/physics/research/astro/postgraduate/galplane/
Sizing Open Access UCT-CS Publication Archive: Average size of PDF document: 848K UCT 2005 Research Report: 4150 Research document artefacts Estimate output for one year: 4150 documents, 3.15 GB And this is only published peer-reviewed research as we know it! What about theses/dissertations? 1000 documents, 5MB each, totalling 5GB What about technical and project reports? 1000 documents, 10MB each, totalling 10GB Courseware? 50MB per course? Datasets? Almost infinite!
Repository Software UnScalability Most systems do not scale well beyond small collections. EPrints DSpace source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT
Service Provision UnScalability UCT-CS Archive: Average of 6.24 user accesses per document per month Average of 18 accesses per document per month For 83000 documents: 1.494 million accesses per month 34.58 accesses per minute What about search/browse and other services? And this is only published peer-reviewed research as we know it!
Some Solutions Devise completely new algorithms and systems to deal with massive quantities of information. Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems… Use computing resources more efficiently, to maximise benefits with minimum cost. Efficient Cluster and Grid computing Make the users’ computer do more work. Client-side computation: AJAX Make the users do all or most of the work. Web 2.0
Scalable Repositories Fedora Digital Repository system developed at Cornell with API for higher level services. Storage Resource Broker Storage abstraction for large-scale stores developed at San Diego Supercomputing Centre. Grid-based Storage Systems Systems to utilise Grid computing for storage of data in distributed fashion. Amazon and Google Third party providers of storage at a premium.
Parallel Harvesting Multiple harvesters cycle through harvest and process operations in parallel. Significant benefit when workload is high. Parallelism helps even on one machine! What about parallel data provision? OAI Data Provider … drone drone Primary Harvester Beowulf cluster or multiprocessor
High(er) Performance IR for the Rest of Us Efficient search engine on a small cluster, more likely in developing countries. Nodes either do querying or indexing and can swap if needed. Reasonably good performance. Work is being extended for larger collections and grids. Terascale IR? Job dispatcher index node query node index node query node … Beowulf cluster
High-Level Component-Based DL Scalability Split DL into components and spread across cluster. Services are Web-distributed. Make services mobile and create replicates. Performance improvement and better use of multiple cheap computers. Moving digital archives to grids can deal with service provision scalability. See DILIGENT… Registry node1 node2 Resolver Resolver instances
Web 2.0 – User Contributions Make users provide as much information as possible. Users managing content means less central management. Greater scalability!
Ajax for DL Services AJAX supports applications or services within the browser. Move computation from server to client Greater scalability!
Scalable Preservation XML is often used for Preservation while Databases are used for Access. How do we make XML tools scalable? Can we?
Concluding Thoughts Open Access and Digital Repositories must consider scalability as fundamental as preservation and access. Realistically, we have not had major scalability problems yet, but we don’t have many Open Access systems either. Google works because scalability is a primary concern. They intend to index the world’s information. If we believe in similar ideals (like producing or curating the world’s information), we too must plan for scalability!
direct all comments to: hussein@cs.uct.ac.za That’s all Folks! direct all comments to: hussein@cs.uct.ac.za