Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.

Slides:



Advertisements
Similar presentations
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Advertisements

Preserv Preservation Eprint Services Scenario: Digital lifecycle begins with author creation and deposit of paper or data content into the institutional.
Putting Eprints Software into the User Community An invitation-only international roundtable workshop organised by JISC and the School of Electronics and.
Building Repositories of eprints in UK Research Universities Bill Hubbard SHERPA Project Manager University of Nottingham.
Lawrence Webley, Hussein Suleman, Tatenda Chipeperekwa University of Cape Town Department of Computer.
Institutional repositories and libraries : being visible Nor Edzan Che Nasir Library University of Malaya.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
Role of Contributing Institutions – The NDL Movement Presented By Dr. B. Sutradhar, Librarian Central Library (ISO 9001:2008 Certified) IIT Kharagpur
What is Wrong with Digital Repository Software? Or why to Archive Now ! Hussein Suleman University of Cape Town Department of Computer.
Digital Library Architecture and Technology
How to participate in the Union Catalogue Project Hussein Suleman Sivulile – Open Access South Africa Advanced Information Management.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
Maynooth’s ePrints & eTheses archive Health Sciences Libraries Group Suzanne Redmond Maloco eprints.nuim.ie.
University of Bergen Library Electronic publishing Bergen – Makerere visit February 2005.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Annah Macha MPhil Student Department of Library & Information Science, UCT A/Prof Karin de Jager Centre for Information Literacy,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
5th SELL Meetting Lisboa, Activities report Government agreement to improve libraries 2.ILS change 3.ICOLC 4.Union catalogue 5.Digital.
The DPubS Development Project: Building an Open Source Electronic Publishing System David Ruddy Cornell University Library.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
SCIELO AS AN OPEN ARCHIVE: the development of SciELO / OpenArchives data provider interface Prof. Carlos H. Marcondes Federal Fluminense University/ Information.
Mirroring an OAI archive with an I2-DSI channel Ryan Richardson Edward A. Fox Digital Library Research Laboratory Virginia Tech May 7 th, 2002.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
PNC 2005 Hawaii Toward an Institutional Repository at the Data Service of NDAP Ya-ning Chen, Shu-jiun Chen Computing Centre, Academia Sinica Taiwan.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.
VIVO and Scholarly Repositories: Synergistic Opportunities.
ETD Software Options Hussein Suleman University of Cape Town October 2003.
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
ScholarSpace & Open UH Mānoa March 2013 Beth Tillinghast Web Support Librarian ScholarSpace & eVols Project Manager UHM Library.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
Hussein Suleman University of Cape Town Department of Computer Science Digital Libraries Laboratory February 2008 Data Curation Repositories:
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction to The Storage Resource.
From ePrints to eSPIDA: Digital Preservation at the University of Glasgow William J Nixon, Service Development DAEDALUS, University of Glasgow DPC: Digital.
DSpace - Digital Library Software
April 14, 2005MIT Libraries Visiting Committee Libraries Strategic Plan Theme III Work to shape the future MacKenzie Smith Associate Director for Technology.
ETDs and NDLTD Hussein Suleman University of Cape Town May 2004.
Harokopio University of Athens – Department of Informatics and Telematics HAROKOPIOUNIVERSITY A Distributed Architecture for Building Federated Digital.
Research, IT & SFU Library Lynn Copeland IT & Advanced Networks Symposium May 8–9, 2006.
The Storage Resource Broker and.
Not to Wait is the Answer: Institutional Repositories from the Bottom-up Hussein Suleman University of Cape Town July 2004.
Institutional Repositories and Licensing of Research Output advanced information management laboratory university of cape town department of computer science.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
Introduction to Open Access and Institutional Repositories Hussein Suleman University of Cape Town October 2004.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Repository Development – Universiteit Antwerpen november 2012.
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
Institutional Repository and Friends
Flipping on the Cheap Hussein Suleman, Lighton Phiri
Flipping Computer Architecture
VI-SEEM Data Repository
SCALABLE OPEN ACCESS Hussein Suleman
The Institutional Repository Toolbox
Hussein Suleman University of Cape Town Department of Computer Science
Institutional Repositories
RCSI institutional repository rcsi
Presentation transcript:

Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance Computing Laboratory April 2007 S C A L A B L E OPEN ACCESS

Open Access free online access to electronic resources: research papers, courseware, ETDs, etc. What is Open Access? Lower costs, Empower producers Empower consumers, Improve visibility … Why? Institutional Repository: online system to manage documents, typically at one institution. Open Access Journal: online system to manage publication and dissemination. How? Advocacy, Policy, Procedures, Management, … Software Tools: DSpace, EPrints, OJS, …

Very Large Data Collections Source:

Sizing Open Access  UCT-CS Publication Archive: Average size of PDF document: 848K  UCT 2005 Research Report: 4150 Research document artefacts  Estimate output for one year: 4150 documents, 3.15 GB  And this is only published peer-reviewed research as we know it!  What about theses/dissertations? 1000 documents, 5MB each, totalling 5GB  What about technical and project reports? 1000 documents, 10MB each, totalling 10GB  Courseware? 50MB per course?  Datasets? Almost infinite!

Repository Software UnScalability  Most systems do not scale well beyond small collections. DSpace EPrints source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT

Service Provision UnScalability  UCT-CS Archive: Average of 6.24 user accesses per document per month Average of 18 accesses per document per month  For documents: million accesses per month accesses per minute  What about search/browse and other services?  And this is only published peer-reviewed research as we know it!

Some Solutions  Devise completely new algorithms and systems to deal with massive quantities of information. Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems…  Use computing resources more efficiently, to maximise benefits with minimum cost. Efficient Cluster and Grid computing  Make the users’ computer do more work. Client-side computation: AJAX  Make the users do all or most of the work. Web 2.0

Scalable Repositories  Fedora Digital Repository system developed at Cornell with API for higher level services.  Storage Resource Broker Storage abstraction for large-scale stores developed at San Diego Supercomputing Centre.  Grid-based Storage Systems Systems to utilise Grid computing for storage of data in distributed fashion.  Amazon and Google Third party providers of storage at a premium.

Parallel Harvesting  Multiple harvesters cycle through harvest and process operations in parallel.  Significant benefit when workload is high.  Parallelism helps even on one machine!  What about parallel data provision? OAI Data Provider Primary Harvester drone … Beowulf cluster or multiprocessor

High(er) Performance IR for the Rest of Us  Efficient search engine on a small cluster, more likely in developing countries.  Nodes either do querying or indexing and can swap if needed.  Reasonably good performance.  Work is being extended for larger collections and grids. Terascale IR? Job dispatcher index node query node … Beowulf cluster index node query node

High-Level Component-Based DL Scalability  Split DL into components and spread across cluster. Services are Web-distributed.  Make services mobile and create replicates.  Performance improvement and better use of multiple cheap computers.  Moving digital archives to grids can deal with service provision scalability. See DILIGENT… Registry Resolver instances node2node1

Web 2.0 – User Contributions  Make users provide as much information as possible.  Users managing content means less central management. Greater scalability!

Ajax for DL Services  AJAX supports applications or services within the browser.  Move computation from server to client Greater scalability!

Scalable Preservation  XML is often used for Preservation while Databases are used for Access.  How do we make XML tools scalable?  Can we?

Concluding Thoughts  Open Access and Digital Repositories must consider scalability as fundamental as preservation and access.  Realistically, we have not had major scalability problems yet, but we don’t have many Open Access systems either.  Google works because scalability is a primary concern. They intend to index the world’s information.  If we believe in similar ideals (like producing or curating the world’s information), we too must plan for scalability!

That’s all Folks! direct all comments to: