Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
DuraSpace: Digital Information All Ways, Always Pretoria, South Africa May 14 th, 2009.
Joint CASC/CCI Workshop Report Strategic and Tactical Recommendations EDUCAUSE Campus Cyberinfrastructure Working Group Coalition for Academic Scientific.
Mairéad Martin, Penn State University Commons Solutions Group Storage Workshop May 2010.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
The Frame NSF-funded national supercomputer centers Centers have hosted significant projects: TeraGrid, NPACI, GEON, SCEC, Chronopolis Fostered development.
ELECTRONIC RECORDS PRESERVATION ARCHIVES OF MICHIGAN.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
The Digital Preservation Network at UT Austin Chris Jordan Texas Advanced Computing Center.
INFSO-RI Enabling Grids for E-sciencE Grid & Data Preservation Boon Low System Development, EGEE Training National.
PREMIS in Thought: Data Center for LC Digital Holdings Ardys Kozbial, Arwen Hutt, David Minor February 11, 2008.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
May Archiving PAWN: A Policy-Driven Software Environment for Implementing Producer- Archive Interactions in Support of Long Term Digital.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information Principal Investigator: Joseph JaJa Lead Programmers: Mike.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
SAN DIEGO SUPERCOMPTER CENTERUC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING David Minor SDSC Robert H. McDonald SDSC Sangchul Song UMIACS Bryan.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
DuraCloud A service provided by Sandy Payette and Michele Kimpton.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
I2 and NDIIPP: Internet2 Infrastructure in Support of the National Preservation Agenda Andy Boyko – Library of Congress Jane Mandelbaum – Library of Congress.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Libraries, Archives, and Digital Preservation: The Reality of What We Must Do Leslie Johnston Acting Director, National Digital Information Infrastructure.
Interoperability within the Grid NDIIPP Partners Meeting Arlington, VA July 9, 2008 Interoperability within the Grid Robert H. McDonald Digital Preservation.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Chronopolis – MetaArchive Improving and Strengthening Inter-Institutional Preservation.
C HRONOPOLIS TM and the D IGITAL P RESERVATION I MPERATIVE Brian E. C. Schottlaender The Audrey Geisel University Librarian ECAR Symposium, 4 December.
DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.
April 14, 2005MIT Libraries Visiting Committee Libraries Strategic Plan Theme III Work to shape the future MacKenzie Smith Associate Director for Technology.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
SAN DIEGO SUPERCOMPUTER CENTER Replication Policies for Federated Digital Repositories Robert H. McDonald Chronopolis Project Manager
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
Data Stewardship Lifecycle A framework for data service professionals Protectors of data.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Robin Dale RLG OAIS Functionality Robin Dale RLG
Data Management Components for a Research Data Archive
Presentation transcript:

Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC

Building Trust in a 3 RD Party Repository: A Pilot Project David Minor San Diego Supercomputer Center

How can the LC trust someone they can’t control?

Moving forward in the right direction requires more than fuzzy promises

… it takes a combination of experts and tools. Cyberinfrastructure

Cyberinfrastructure is the collection of... Resources + Glue Computers, data storage, networks, scientific instruments, experts, etc. Integrating software, systems, and organizations

“Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.” - ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences

“The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure”

SDSC... Is one of the original NSF supercomputer centers Supports high performance computing systems Supports data applications for science, engineering, social sciences, cultural heritage institutions Has LARGE data capabilities 3+ PB Disk Storage 25+ PB Tape Storage

UCSD Libraries 3.5+ million volumes Digital Access Management System (in development) 250,000+ objects 15+ TB Shared collections with UC California Digital Library Digital Preservation Repository eScholarship repository

Partnerships and Collaborations LC Pilot Project – Building Trust in a 3 rd Party Repository – Using test image collections/web crawls ingest content to SDSC repository – Allow access for content audit – Track usage of content over time – Deliver content back to LC at end of project Library of Congress NDIIPP Chronopolis Program – Build Production Capable Chronopolis Grid (50 TB x 3) – Further define transmission packaging for archival communities – Investigate best network transfer models for I2 and TeraGrid networks California Digital Library (CDL) Mass Transit Program – Enable UC System Libraries to transfer high-speed mass digitization collections across CENIC/I2 – Develop transmission packaging for CDL content UCSD Libraries’ Digital Asset Management System – RDF System with data managed in SRB at SDSC

SDSC DPI Group Digital Preservation Initiatives Group – Charged with Developing and Supporting Digital Preservation Services within the Production Systems Division of SDSC. – – Cross-Organizational Group SDSC Personnel/UCSD Libraries Personnel – Libraries – Archives – Technology – Information Science

CyberinfrastructureTrust

For Example:

We worked together to setup high speed data replication services Checksums Achieved 200Mb/s = 2 TB/day Highly reliable Internet2

Network setup involved … LC and SDSC staff working together Configurations on networks and computers Resolving different security environments Network monitoring

Networking is hard! Can’t forget it once it’s setup It’s not magic - there’s always a reason It highlights collaborative nature of work Lessons Learned

Has a long-term solution been found? Have multi-institutional issues been solved? Does new infrastructure improve process? Trust Elements Is solution useful for other organizations?

SDSC created a robust storage environment for this data Multiple replications … … at SDSC … and geographically diverse locations

(a process with several characteristics) Needed to replicate structure exactly This had to be done for 5+ replications Complex environment had to be transparent Data had to be available for manipulation

The Storage Resource Broker provided replication services...

... and extensive monitoring, logging and reporting functions (which led to many conversations)

Logging and monitoring procedures Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward But … What is the master list and who maintains it? Who decides what is a legitimate change? Do you want a dark archive or an active remote data center?

We tested a new Front-End

… and explored an important issue “Reliability” Versus “Accessibility”

Always keep expectations aligned Don’t confuse accessibility and reliability Duplication of structure is complicated Communication highlights communication Lessons Learned

Can remote data be accessed? Can remote data be retrieved and re-used? Can remote data be verified? Can ownership be clearly defined? Trust Elements

50,000 ARC files 6 Terabytes of data Short processing time Parallel indexing and display system Looked “default” to the user SDSC and LC explored a new approach to working with web archives

Using default tools, our initial indexing rate was 1000 files per day… This was over our time budget. … more than 6 weeks of constant computing to index entire collection.

We ran 18 parallel indexing instances – reduced processing to a week We modified the Wayback sourcecode to create a new access infrastructure

Sometimes you need to start over Default setup isn’t always easiest Time is a wonderful motivator Experts are often interested in your work Lessons Learned

Can a new organization bring new expertise? Are the final results the same? Can the results be reached in a better way? Can a new organization work with your partners? Trust Elements

Next steps …. Chronopolis!

Chronopolis: A Partnership Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries. Initial Chronopolis provider sites include: SDSC and UCSD Libraries at UC San Diego University of Maryland National Center for Atmospheric Research (NCAR) in Boulder, CO UCSD Libraries

Institutions and Roles - UCSD SDSC – Storage and networking services – SRB support – Transmission Packaging Modules UCSD Libraries – Metadata services (PREMIS) – DIPs (Dissemination Information Packages) – Other advanced data services as needed

Institutions and Roles - NCAR National Center for Atmospheric Research – Archives: Complete copy of all data – Storage and network support – Network testing

Institutions and Roles - UMIACS University of Maryland – Institute for Advanced Computer Studies – Archives: Complete copy of all data – Advanced data services PAWN: Producer – Archive Workflow Network in Support of Digital Preservation ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives – Other advanced data services as needed

SDSC Chronopolis Program

Chronopolis Vocabulary Partners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services. Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network. SRB – Storage Resource Broker – datagrid software. iRODS – integrated Rule Oriented Data System – datagrid software. ACE – Audit Control Cnvironment – part of the ADAPT project at UMD. PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD. INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware. Bagit – Transfer specification developed by CDL and the Library of Congress. GridFTP – parallel transfer technology - moves large collections within a grid wide- area network.

Chronopolis: Inside Linked by main staging grid where data is verified for integrity, and quarantined for security purposes. Collections are independently pulled into each system. Manifest layer provides added security for database management and data integrity validation. Benefits – 3 independently managed copies of the collection – High availability – High reliability NCAR SDSC Core Center Archive SDSC Staging Grid Pull Chron Clients: CDL ICPSR Pull Push UMD Copy 1 Copy 2Copy 3 Manifest Management MCAT DB Multiple Hash Verifications Grid Brick Disks MCAT HPSS Tape Grid Brick Disks

SDSC Leveraged Infrastructure Serves Both HPC & Digital Preservation Archive 25 PB capacity Both HPSS & SAM-QFS Online disk ~3PB total HPC parallel file systems Collections Databases Access Tools Adapted from Richard Moore (SDSC)

Chronopolis Demonstration Project Demonstration Project – Demonstration Collections Ingested within Chronopolis National Virtual Observatory (NVO) – 3 TB Hyperatlas Images (partial collection) Library of Congress PG Image Collection – 600 GB Prokudin-Gorskii Image Collection Interuniversity Consortium for Political and Social Research (ICPSR) – 2TB Web Accessible Data NCAR Observational Data – 3TB Observational Re-Analysis Data

NDIIPP Chronopolis Project Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR Installing and testing a suite of monitoring tools using ACE, PAWN, INCA Creating Appropriate Transmission Information Packages Generating PREMIS definitions for data Writing Best Practices documents for clients and partners

Chronopolis Grid Framework Sun TB Sun TB SRB D-Broker SRB D-Broker SRB MCAT Sun SAM-QFS SRB D-Broker SRB D-Broker SRB MCAT Apple Xsan SRB D-Broker SRB D-Broker SRB MCAT CDL Server ICPSR Server NCAR Network Maryland Network SDSC Network ICPSR Network UC BerkeleyNet work Chronopolis Data 12-25TB Chronopolis Data 12-25TB Chronopolis Data 12TB Chronopolis Data 12TB CDL Server SDSC Network NCAR Network UMD Network Tape Silos Adapted from Bryan Banister (SDSC)

NDIIPP Chronopolis Clients-CDL California Digital Library – A part of UCOP, supports the University of California libraries – Providing up to 25TB of data: Web-At-Risk project Five years of political and governmental websites ARC files created from web crawls Using Bagit Transfer Structure

Diagram of CDL Data Transfer CDL Virtual Machine at UCB SDSC Network Wget Bagit Wget files 1-10, File n Bagit Manifest File 1 Possible SRB/Bagit Module UMIACS Chron Staging Chron Repository NCAR Parallel Wget Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)

NDIIPP Chronopolis Clients-ICPSR Inter-University Consortium for Political and Social Research, University of Michigan – of data: Wide variety of types – Already working with SDSC using SRB

Diagram of ICSPR Transfer ICPSR SRB Repository UMich SDSC Network Sput/Srsync Files Sput tar files File n EMC SAN File 1 Chron SRB MCAT UMIACS Chron Staging Chron Repository NCAR Parallel Sput/Srsync Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)

Ongoing and Future Initiatives Migration of Chronopolis from SRB to iRODS Develop Interoperability with Community Based Archival Systems/Standards TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium

Looking for Partnerships Repositories interested in moving large digital collections among heterogeneous repository systems. Fedora, DSpace or E-Prints sites interested in managed datagrid storage. Institutions interested in personnel swaps to conduct TRAC audit assessment compliance. Community Needs for Mass-Scale Data Transmission and Storage.

Chronopolis Credits SDSC –Fran Berman –Richard Moore –David Minor –Chris Jordan –Jim D’Aoust –Robert McDonald –Don Sutton –Brian Banister –Phong Dinh –Jay Dombrowski –Emilio Valente UCSD Libraries –Brian Schottlaender –Luc Declerck –Ardys Kozbial –Brad Westbrook –Arwen Hutt NCAR –Don Middleton –Michael Burek –Linda McGinley UMIACS –Joseph JaJa –Mike Smorul –Mike McGann Library of Congress –Martha Anderson –Lisa Hoppis CACI –Mike Ivey

a geographically distributed preservation environment that supports long-term management and stewardship of digital collections implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment. Chronopolis is...

Assessment of the needs of potential user communities and development of appropriate service models Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. Development of cost and risk models for long-term preservation Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure Chronopolis focuses on...

UCSD Libraries The people of Chronopolis are...

Organizations need ways to validate trust in 3rd parties In conclusion …

… and demonstrating trust. SDSC and the Library of Congress explored one way to do this … by working with Cyberinfrastructure

With a trusted relationship, many journeys become possible