New Value from the DSpace Foundation and Fedora Commons Michele Kimpton and Sandy Payette Executive Directors DuraSpace
Social and Technical Forces (2000-present) Waves of Repository-Enabled Applications Institutional Repositories Digital Collections Digital Libraries Collaborative Spaces and “Web 2.0” Scholarly and Scientific Infrastructure E-Research Data (archiving, linking, sharing)
Implications for our future work more distributed more collaborative more web - oriented more open more interoperable
Emergence of Infrastructure Source: Understanding Infrastructure: Lessons for New Scientific Infrastructure, Systems Integrate components Central control Dedicated/specialized gateways More closed More preconceived Integrate systems Distributed control Generic gateways More open More reconfigurable Networks
Source: Francine Berman, Got Data? A Guide to Data Preservation in the Information Age, pp December 2008 page 55 page 53
History: DSpace and Fedora Two open source repository systems –DSpace: End-user application and repository Turn key system providing easy out-of-box –Fedora: Web services (repository and supporting services) Flexible, modular, and scalable Enabling technology supporting… –scholarship, science, culture, education –open access –preservation and archiving
DSpace and Fedora Installations Largest share of open repositories worldwide … over 700 institutions tracked in our registries Universities Research Centers Libraries Archives Cultural Heritage Government More…
DSpace Foundation and Fedora Commons 501(c)(3) non-profit organizations Common toolsInteroperabilityNew tools and services Web APIs Storage Abstraction Architecture Strategy SWORD Deposit MS Word Plug-In DuraSpace Future Joint Offerings Business Strategy Communication/Outreach Progression of Partnership
Goals of Strategic Partnership Stewardship: – Support and align open source development communities for DSpace and Fedora –Keepers of the cause (durability + access) Innovation: –Think beyond existing platforms –New strategic directions for repositories –New products and services Sustainability: –Devise business models that fit our sector –Services that generate revenue for non-profits
What About the Cloud? An emerging architecture in which data and applications reside in cyberspace, allowing users to access via the internet (Pew Internet 9/08) A style of computing where massively scalable IT-related capabilities are provided “as a service” using Internet technologies to multiple external customers. (Gartner, 6/08).
Types of Cloud Services Software as a Service (SAAS) –e.g., Google Apps Cloud Computing –e.g., Amazon Elastic Compute Cloud (EC2) Cloud Storage –e.g., Amazon Simple Storage Service (S3)
Cloud Services
Vision: Federated Repositories and Cyberinfrastructure DuraSpace Heaven
DuraSpace Proposition Trust and durability in the cloud
What have we learned from our users? Focus Groups Site Visits Forums
Problems Tools and processes unproven Limited IT support Capital expenditures limited Task can be overwhelming ( replication, migration, emulation ect.) Preservation important but difficult to implement
Problems Systems not interoperable Heterogeneous applications/platforms Lack of commons standards Inelastic compute capability Barriers to making content more accessible and useful to researchers
Advantages – Cloud Services Flexibility Scalability Pay for use Easy to implement Cost
Public cloud providers drive cost down through scale, location and virtualization technology Large Data centers(50k+) can achieve 5 to 7 times costs savings over Medium Data Centers(1,000) *Hamilton, J Internet-Scale Service Efficiency (Sept 08) Technology*Cost Med DCCost Large DC Network$95 per Mbit/sec/mo$13 per Mbit/sec/mo Storage$2.20 per Gbyte/mo$.40 per Gbyte/mo Admin140 servers/admin>1000 servers/admin
Issues Security Transparency Data lock in SLA’s Trust
DuraSpace Trusted management of and access to durable digital assets in the cloud DuraSpace Mediating Service
DuraSpace- Notional Architecture
Architectural view
Core services-Preservation based Replicate to multiple storage providers Replicate to multiple geographic areas Be able to manage content and services through web based “Dashboard” Includes integrity checking and monitoring “Pay for use” for services and storage
Technology Services Build and run services on top of content stored in the cloud –Search –Aggregation –Streaming –Migration –Hosting Enable others to build services/apps on top of content
Use Cases: DuraSpace with Cloud Storage Online backup for text, images, datasets, video, audio Preservation-Multiple copies, geographies, administrations Temporary or permanent project storage
Use cases: DuraSpace with Cloud Compute Streaming service for video JPEG2000 image engine Indexing and other processing heavy jobs Staging area for repository ingest Repositories in cloud Data and text mining over open data Aggregation and web 2.0 tools on open content and collections
DuraSpace software Open source - apache license Open core Run Your Own: Private clouds, University consortia Extensible: Research partners
Critical success factors Ease of use- simplicity Trusted partner for end user Cost effective Scalable/Flexible Can establish key partnerships with service providers Can build community of developers and users
Timeline Identified initial cloud partners Identified initial pilot partners Defined initial requirements Initial open source release -Q Begin pilot- Fall 2009 Extensions available for repository platforms- Q Roll out to Repository community-Q Launch production service Q2 2010
Initial capabilities Replication, up to three providers (including local store) Web based “Dashboard” Data integrity checking and monitoring Can push content from DSpace/Fedora repository platform Integrated billing Compute capability A few initial compute services TBD
Listen… Sandy and Michele’s DuraSpace webinar