Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.

Hussein Suleman hussein@cs.uct.ac.za University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance Computing Laboratory April 2007 S C A L A B L E OPEN ACCESS

Open Access free online access to electronic resources: research papers, courseware, ETDs, etc. What is Open Access? Lower costs, Empower producers Empower consumers, Improve visibility … Why? Institutional Repository: online system to manage documents, typically at one institution. Open Access Journal: online system to manage publication and dissemination. How? Advocacy, Policy, Procedures, Management, … Software Tools: DSpace, EPrints, OJS, …

Very Large Data Collections Source: http://www2.warwick.ac.uk/fac/sci/physics/research/astro/postgraduate/galplane/

Sizing Open Access  UCT-CS Publication Archive: Average size of PDF document: 848K  UCT 2005 Research Report: 4150 Research document artefacts  Estimate output for one year: 4150 documents, 3.15 GB  And this is only published peer-reviewed research as we know it!  What about theses/dissertations? 1000 documents, 5MB each, totalling 5GB  What about technical and project reports? 1000 documents, 10MB each, totalling 10GB  Courseware? 50MB per course?  Datasets? Almost infinite!

Repository Software UnScalability  Most systems do not scale well beyond small collections. DSpace EPrints source: Technical Evaluation of selected Open Source Repository Systems, Catalyst IT

Service Provision UnScalability  UCT-CS Archive: Average of 6.24 user accesses per document per month Average of 18 accesses per document per month  For 83000 documents: 1.494 million accesses per month 34.58 accesses per minute  What about search/browse and other services?  And this is only published peer-reviewed research as we know it!

Some Solutions  Devise completely new algorithms and systems to deal with massive quantities of information. Fedora/SRB/etc., Parallel OAI-PMH, Terascale IR systems…  Use computing resources more efficiently, to maximise benefits with minimum cost. Efficient Cluster and Grid computing  Make the users’ computer do more work. Client-side computation: AJAX  Make the users do all or most of the work. Web 2.0

Scalable Repositories  Fedora Digital Repository system developed at Cornell with API for higher level services.  Storage Resource Broker Storage abstraction for large-scale stores developed at San Diego Supercomputing Centre.  Grid-based Storage Systems Systems to utilise Grid computing for storage of data in distributed fashion.  Amazon and Google Third party providers of storage at a premium.

Parallel Harvesting  Multiple harvesters cycle through harvest and process operations in parallel.  Significant benefit when workload is high.  Parallelism helps even on one machine!  What about parallel data provision? OAI Data Provider Primary Harvester drone … Beowulf cluster or multiprocessor

High(er) Performance IR for the Rest of Us  Efficient search engine on a small cluster, more likely in developing countries.  Nodes either do querying or indexing and can swap if needed.  Reasonably good performance.  Work is being extended for larger collections and grids. Terascale IR? Job dispatcher index node query node … Beowulf cluster index node query node

High-Level Component-Based DL Scalability  Split DL into components and spread across cluster. Services are Web-distributed.  Make services mobile and create replicates.  Performance improvement and better use of multiple cheap computers.  Moving digital archives to grids can deal with service provision scalability. See DILIGENT… Registry Resolver instances node2node1

Web 2.0 – User Contributions  Make users provide as much information as possible.  Users managing content means less central management. Greater scalability!

Ajax for DL Services  AJAX supports applications or services within the browser.  Move computation from server to client Greater scalability!

Scalable Preservation  XML is often used for Preservation while Databases are used for Access.  How do we make XML tools scalable?  Can we?

Concluding Thoughts  Open Access and Digital Repositories must consider scalability as fundamental as preservation and access.  Realistically, we have not had major scalability problems yet, but we don’t have many Open Access systems either.  Google works because scalability is a primary concern. They intend to index the world’s information.  If we believe in similar ideals (like producing or curating the world’s information), we too must plan for scalability!

That’s all Folks! direct all comments to: hussein@cs.uct.ac.za

Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.

Similar presentations

Presentation on theme: "Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.

Similar presentations

Presentation on theme: "Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance."— Presentation transcript:

Similar presentations

About project

Feedback