Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC.

Similar presentations


Presentation on theme: "Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC."— Presentation transcript:

1 Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC

2 Building Trust in a 3 RD Party Repository: A Pilot Project David Minor San Diego Supercomputer Center

3

4

5

6 How can the LC trust someone they can’t control?

7

8 Moving forward in the right direction requires more than fuzzy promises

9 … it takes a combination of experts and tools. Cyberinfrastructure

10 Cyberinfrastructure is the collection of... Resources + Glue Computers, data storage, networks, scientific instruments, experts, etc. Integrating software, systems, and organizations

11 “Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.” - ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences

12 “The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure”

13 SDSC... Is one of the original NSF supercomputer centers Supports high performance computing systems Supports data applications for science, engineering, social sciences, cultural heritage institutions Has LARGE data capabilities 3+ PB Disk Storage 25+ PB Tape Storage

14 UCSD Libraries 3.5+ million volumes Digital Access Management System (in development) 250,000+ objects 15+ TB Shared collections with UC California Digital Library Digital Preservation Repository eScholarship repository

15 Partnerships and Collaborations LC Pilot Project – Building Trust in a 3 rd Party Repository – Using test image collections/web crawls ingest content to SDSC repository – Allow access for content audit – Track usage of content over time – Deliver content back to LC at end of project Library of Congress NDIIPP Chronopolis Program – Build Production Capable Chronopolis Grid (50 TB x 3) – Further define transmission packaging for archival communities – Investigate best network transfer models for I2 and TeraGrid networks California Digital Library (CDL) Mass Transit Program – Enable UC System Libraries to transfer high-speed mass digitization collections across CENIC/I2 – Develop transmission packaging for CDL content UCSD Libraries’ Digital Asset Management System – RDF System with data managed in SRB at SDSC

16 SDSC DPI Group Digital Preservation Initiatives Group – Charged with Developing and Supporting Digital Preservation Services within the Production Systems Division of SDSC. – http://dpi.sdsc.edu http://dpi.sdsc.edu – Cross-Organizational Group SDSC Personnel/UCSD Libraries Personnel – Libraries – Archives – Technology – Information Science

17 CyberinfrastructureTrust

18 For Example:

19 We worked together to setup high speed data replication services Checksums Achieved 200Mb/s = 2 TB/day Highly reliable Internet2

20 Network setup involved … LC and SDSC staff working together Configurations on networks and computers Resolving different security environments Network monitoring

21 Networking is hard! Can’t forget it once it’s setup It’s not magic - there’s always a reason It highlights collaborative nature of work Lessons Learned

22 Has a long-term solution been found? Have multi-institutional issues been solved? Does new infrastructure improve process? Trust Elements Is solution useful for other organizations?

23

24 SDSC created a robust storage environment for this data Multiple replications … … at SDSC … and geographically diverse locations

25 (a process with several characteristics) Needed to replicate structure exactly This had to be done for 5+ replications Complex environment had to be transparent Data had to be available for manipulation

26 The Storage Resource Broker provided replication services...

27 ... and extensive monitoring, logging and reporting functions (which led to many conversations)

28 Logging and monitoring procedures Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward But … What is the master list and who maintains it? Who decides what is a legitimate change? Do you want a dark archive or an active remote data center?

29 We tested a new Front-End

30 … and explored an important issue “Reliability” Versus “Accessibility”

31 Always keep expectations aligned Don’t confuse accessibility and reliability Duplication of structure is complicated Communication highlights communication Lessons Learned

32 Can remote data be accessed? Can remote data be retrieved and re-used? Can remote data be verified? Can ownership be clearly defined? Trust Elements

33 50,000 ARC files 6 Terabytes of data Short processing time Parallel indexing and display system Looked “default” to the user SDSC and LC explored a new approach to working with web archives

34 Using default tools, our initial indexing rate was 1000 files per day… This was over our time budget. … more than 6 weeks of constant computing to index entire collection.

35 We ran 18 parallel indexing instances – reduced processing to a week We modified the Wayback sourcecode to create a new access infrastructure

36 Sometimes you need to start over Default setup isn’t always easiest Time is a wonderful motivator Experts are often interested in your work Lessons Learned

37 Can a new organization bring new expertise? Are the final results the same? Can the results be reached in a better way? Can a new organization work with your partners? Trust Elements

38 Next steps …. Chronopolis!

39 Chronopolis: A Partnership Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries. Initial Chronopolis provider sites include: SDSC and UCSD Libraries at UC San Diego University of Maryland National Center for Atmospheric Research (NCAR) in Boulder, CO UCSD Libraries

40 Institutions and Roles - UCSD SDSC – Storage and networking services – SRB support – Transmission Packaging Modules UCSD Libraries – Metadata services (PREMIS) – DIPs (Dissemination Information Packages) – Other advanced data services as needed

41 Institutions and Roles - NCAR National Center for Atmospheric Research – Archives: Complete copy of all data – Storage and network support – Network testing

42 Institutions and Roles - UMIACS University of Maryland – Institute for Advanced Computer Studies – Archives: Complete copy of all data – Advanced data services PAWN: Producer – Archive Workflow Network in Support of Digital Preservation ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives – Other advanced data services as needed

43 SDSC Chronopolis Program

44 Chronopolis Vocabulary Partners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services. Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network. SRB – Storage Resource Broker – datagrid software. iRODS – integrated Rule Oriented Data System – datagrid software. ACE – Audit Control Cnvironment – part of the ADAPT project at UMD. PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD. INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware. Bagit – Transfer specification developed by CDL and the Library of Congress. GridFTP – parallel transfer technology - moves large collections within a grid wide- area network.

45 Chronopolis: Inside Linked by main staging grid where data is verified for integrity, and quarantined for security purposes. Collections are independently pulled into each system. Manifest layer provides added security for database management and data integrity validation. Benefits – 3 independently managed copies of the collection – High availability – High reliability NCAR SDSC Core Center Archive SDSC Staging Grid Pull Chron Clients: CDL ICPSR Pull Push UMD Copy 1 Copy 2Copy 3 Manifest Management MCAT DB Multiple Hash Verifications Grid Brick Disks MCAT HPSS Tape Grid Brick Disks

46 SDSC Leveraged Infrastructure Serves Both HPC & Digital Preservation Archive 25 PB capacity Both HPSS & SAM-QFS Online disk ~3PB total HPC parallel file systems Collections Databases Access Tools Adapted from Richard Moore (SDSC)

47 Chronopolis Demonstration Project Demonstration Project 2006-2007 – Demonstration Collections Ingested within Chronopolis National Virtual Observatory (NVO) – 3 TB Hyperatlas Images (partial collection) Library of Congress PG Image Collection – 600 GB Prokudin-Gorskii Image Collection Interuniversity Consortium for Political and Social Research (ICPSR) – 2TB Web Accessible Data NCAR Observational Data – 3TB Observational Re-Analysis Data

48 NDIIPP Chronopolis Project Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR Installing and testing a suite of monitoring tools using ACE, PAWN, INCA Creating Appropriate Transmission Information Packages Generating PREMIS definitions for data Writing Best Practices documents for clients and partners

49 Chronopolis Grid Framework Sun 6140 62TB Sun 6140 62TB SRB D-Broker SRB D-Broker SRB MCAT Sun SAM-QFS SRB D-Broker SRB D-Broker SRB MCAT Apple Xsan SRB D-Broker SRB D-Broker SRB MCAT CDL Server ICPSR Server NCAR Network Maryland Network SDSC Network ICPSR Network UC BerkeleyNet work Chronopolis Data 12-25TB Chronopolis Data 12-25TB Chronopolis Data 12TB Chronopolis Data 12TB CDL Server SDSC Network NCAR Network UMD Network Tape Silos Adapted from Bryan Banister (SDSC)

50 NDIIPP Chronopolis Clients-CDL California Digital Library – A part of UCOP, supports the University of California libraries – Providing up to 25TB of data: Web-At-Risk project Five years of political and governmental websites ARC files created from web crawls Using Bagit Transfer Structure

51 Diagram of CDL Data Transfer CDL Virtual Machine at UCB SDSC Network Wget Bagit Wget files 1-10, 11-20 File n Bagit Manifest File 1 Possible SRB/Bagit Module UMIACS Chron Staging Chron Repository NCAR Parallel Wget Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)

52 NDIIPP Chronopolis Clients-ICPSR Inter-University Consortium for Political and Social Research, University of Michigan – Providing @12TB of data: Wide variety of types – Already working with SDSC using SRB

53 Diagram of ICSPR Transfer ICPSR SRB Repository UMich SDSC Network Sput/Srsync Files Sput tar files File n EMC SAN File 1 Chron SRB MCAT UMIACS Chron Staging Chron Repository NCAR Parallel Sput/Srsync Xfer UMIACS Network NCAR Network Adapted from Bryan Banister (SDSC)

54 Ongoing and Future Initiatives Migration of Chronopolis from SRB to iRODS Develop Interoperability with Community Based Archival Systems/Standards TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium

55 Looking for Partnerships Repositories interested in moving large digital collections among heterogeneous repository systems. Fedora, DSpace or E-Prints sites interested in managed datagrid storage. Institutions interested in personnel swaps to conduct TRAC audit assessment compliance. Community Needs for Mass-Scale Data Transmission and Storage.

56 Chronopolis Credits SDSC –Fran Berman –Richard Moore –David Minor –Chris Jordan –Jim D’Aoust –Robert McDonald –Don Sutton –Brian Banister –Phong Dinh –Jay Dombrowski –Emilio Valente UCSD Libraries –Brian Schottlaender –Luc Declerck –Ardys Kozbial –Brad Westbrook –Arwen Hutt NCAR –Don Middleton –Michael Burek –Linda McGinley UMIACS –Joseph JaJa –Mike Smorul –Mike McGann Library of Congress –Martha Anderson –Lisa Hoppis CACI –Mike Ivey

57 http://chronopolis.sdsc.edu

58

59

60

61 a geographically distributed preservation environment that supports long-term management and stewardship of digital collections implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment. Chronopolis is...

62 Assessment of the needs of potential user communities and development of appropriate service models Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. Development of cost and risk models for long-term preservation Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure Chronopolis focuses on...

63 UCSD Libraries The people of Chronopolis are...

64 Organizations need ways to validate trust in 3rd parties In conclusion …

65

66 … and demonstrating trust. SDSC and the Library of Congress explored one way to do this … by working with Cyberinfrastructure

67 With a trusted relationship, many journeys become possible


Download ppt "Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC."

Similar presentations


Ads by Google