SCALABLE OPEN ACCESS Hussein Suleman

Slides:



Advertisements
Similar presentations
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Advertisements

Preserv Preservation Eprint Services Simple Preservation Services – towards Proactive Support for the Institutional Repository.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Preserv Preservation Eprint Services Scenario: Digital lifecycle begins with author creation and deposit of paper or data content into the institutional.
Putting Eprints Software into the User Community An invitation-only international roundtable workshop organised by JISC and the School of Electronics and.
Building Repositories of eprints in UK Research Universities Bill Hubbard SHERPA Project Manager University of Nottingham.
Lawrence Webley, Hussein Suleman, Tatenda Chipeperekwa University of Cape Town Department of Computer.
Institutional repositories and libraries : being visible Nor Edzan Che Nasir Library University of Malaya.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
Role of Contributing Institutions – The NDL Movement Presented By Dr. B. Sutradhar, Librarian Central Library (ISO 9001:2008 Certified) IIT Kharagpur
What is Wrong with Digital Repository Software? Or why to Archive Now ! Hussein Suleman University of Cape Town Department of Computer.
Digital Library Architecture and Technology
How to participate in the Union Catalogue Project Hussein Suleman Sivulile – Open Access South Africa Advanced Information Management.
Social Science Data and ETDs: Issues and Challenges Joan Cheverie Georgetown University Myron Gutmann ICPSR – University of Michigan Austin McLean ProQuest.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Maynooth’s ePrints & eTheses archive Health Sciences Libraries Group Suzanne Redmond Maloco eprints.nuim.ie.
University of Bergen Library Electronic publishing Bergen – Makerere visit February 2005.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Annah Macha MPhil Student Department of Library & Information Science, UCT A/Prof Karin de Jager Centre for Information Literacy,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
5th SELL Meetting Lisboa, Activities report Government agreement to improve libraries 2.ILS change 3.ICOLC 4.Union catalogue 5.Digital.
The DPubS Development Project: Building an Open Source Electronic Publishing System David Ruddy Cornell University Library.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
PNC 2005 Hawaii Toward an Institutional Repository at the Data Service of NDAP Ya-ning Chen, Shu-jiun Chen Computing Centre, Academia Sinica Taiwan.
Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.
VIVO and Scholarly Repositories: Synergistic Opportunities.
ETD Software Options Hussein Suleman University of Cape Town October 2003.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
ScholarSpace & Open UH Mānoa March 2013 Beth Tillinghast Web Support Librarian ScholarSpace & eVols Project Manager UHM Library.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
Agenda Why discuss Digital Libraries What is a digital Library History Meta-data FEDORA NSDL D Space.
Hussein Suleman University of Cape Town Department of Computer Science Digital Libraries Laboratory February 2008 Data Curation Repositories:
Introduction to The Storage Resource.
From ePrints to eSPIDA: Digital Preservation at the University of Glasgow William J Nixon, Service Development DAEDALUS, University of Glasgow DPC: Digital.
April 14, 2005MIT Libraries Visiting Committee Libraries Strategic Plan Theme III Work to shape the future MacKenzie Smith Associate Director for Technology.
ETDs and NDLTD Hussein Suleman University of Cape Town May 2004.
Harokopio University of Athens – Department of Informatics and Telematics HAROKOPIOUNIVERSITY A Distributed Architecture for Building Federated Digital.
Research, IT & SFU Library Lynn Copeland IT & Advanced Networks Symposium May 8–9, 2006.
The Storage Resource Broker and.
Not to Wait is the Answer: Institutional Repositories from the Bottom-up Hussein Suleman University of Cape Town July 2004.
Institutional Repositories and Licensing of Research Output advanced information management laboratory university of cape town department of computer science.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Repository Development – Universiteit Antwerpen november 2012.
Building Library Web Site Using Drupal
Organizations Are Embracing New Opportunities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
Institutional Repository and Friends
Flipping on the Cheap Hussein Suleman, Lighton Phiri
Flipping Computer Architecture
VI-SEEM Data Repository
Introduction to client/server architecture
Introduction to Implementing an Institutional Repository
Introduction to DSpace
The Institutional Repository Toolbox
Hussein Suleman University of Cape Town Department of Computer Science
Institutional Repositories
RCSI institutional repository rcsi
Presentation transcript:

SCALABLE OPEN ACCESS Hussein Suleman hussein@cs.uct.ac.za University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance Computing Laboratory April 2007

Open Access What is Open Access? Why? free online access to electronic resources: research papers, courseware, ETDs, etc. Lower costs, Empower producers Empower consumers, Improve visibility … How? Institutional Repository: online system to manage documents, typically at one institution. Open Access Journal: online system to manage publication and dissemination. Advocacy, Policy, Procedures, Management, … Software Tools: DSpace, EPrints, OJS, …

Very Large Data Collections Source: http://www2.warwick.ac.uk/fac/sci/physics/research/astro/postgraduate/galplane/

Sizing Open Access UCT-CS Publication Archive: Average size of PDF document: 848K UCT 2005 Research Report: 4150 Research document artefacts Estimate output for one year: 4150 documents, 3.15 GB And this is only published peer-reviewed research as we know it! What about theses/dissertations? 1000 documents, 5MB each, totalling 5GB What about technical and project reports? 1000 documents, 10MB each, totalling 10GB Courseware? 50MB per course? Datasets? Almost infinite!

Repository Software UnScalability Most systems do not scale well beyond small collections. EPrints DSpace source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT

Service Provision UnScalability UCT-CS Archive: Average of 6.24 user accesses per document per month Average of 18 accesses per document per month For 83000 documents: 1.494 million accesses per month 34.58 accesses per minute What about search/browse and other services? And this is only published peer-reviewed research as we know it!

Some Solutions Devise completely new algorithms and systems to deal with massive quantities of information. Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems… Use computing resources more efficiently, to maximise benefits with minimum cost. Efficient Cluster and Grid computing Make the users’ computer do more work. Client-side computation: AJAX Make the users do all or most of the work. Web 2.0

Scalable Repositories Fedora Digital Repository system developed at Cornell with API for higher level services. Storage Resource Broker Storage abstraction for large-scale stores developed at San Diego Supercomputing Centre. Grid-based Storage Systems Systems to utilise Grid computing for storage of data in distributed fashion. Amazon and Google Third party providers of storage at a premium.

Parallel Harvesting Multiple harvesters cycle through harvest and process operations in parallel. Significant benefit when workload is high. Parallelism helps even on one machine! What about parallel data provision? OAI Data Provider … drone drone Primary Harvester Beowulf cluster or multiprocessor

High(er) Performance IR for the Rest of Us Efficient search engine on a small cluster, more likely in developing countries. Nodes either do querying or indexing and can swap if needed. Reasonably good performance. Work is being extended for larger collections and grids. Terascale IR? Job dispatcher index node query node index node query node … Beowulf cluster

High-Level Component-Based DL Scalability Split DL into components and spread across cluster. Services are Web-distributed. Make services mobile and create replicates. Performance improvement and better use of multiple cheap computers. Moving digital archives to grids can deal with service provision scalability. See DILIGENT… Registry node1 node2 Resolver Resolver instances

Web 2.0 – User Contributions Make users provide as much information as possible. Users managing content means less central management. Greater scalability!

Ajax for DL Services AJAX supports applications or services within the browser. Move computation from server to client Greater scalability!

Scalable Preservation XML is often used for Preservation while Databases are used for Access. How do we make XML tools scalable? Can we?

Concluding Thoughts Open Access and Digital Repositories must consider scalability as fundamental as preservation and access. Realistically, we have not had major scalability problems yet, but we don’t have many Open Access systems either. Google works because scalability is a primary concern. They intend to index the world’s information. If we believe in similar ideals (like producing or curating the world’s information), we too must plan for scalability!

direct all comments to: hussein@cs.uct.ac.za That’s all Folks! direct all comments to: hussein@cs.uct.ac.za