Arctos at the University of Alaska Museum Insect Collection Derek Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University of Alaska Museum Fairbanks, AK.

Slides:



Advertisements
Similar presentations
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
Advertisements

Chapter 10: Designing Databases
Chapter 16: Recovery System
Virtualizing Entomology Collection Student: Di Wang (Alan) Sponsors: John Marris: Curator, Entomology Research Museum Stuart Charters: Department of Applied.
NYBG + KE EMu The New York Botanical Garden + KE EMu Melissa Tulig Botanical Information Management.
Database Management System MIS 520 – Database Theory Fall 2001 (Day) Lecture 13.
Database Management: Getting Data Together Chapter 14.
Harvard University Oracle Database Administration Session 2 System Level.
Introduction to Databases Transparencies
Chapter 8 Security Transparencies © Pearson Education Limited 1995, 2005.
Concepts of Database Management Seventh Edition
Preservasi Informasi Digital.  It will never happen here!  Common Causes of Loss of Data  Accidental Erasure (delete, power, backup)  Viruses and.
Deployment Options Frank Bergmann
Chapter 1 Introduction to Databases
Backup and Recovery Part 1.
Chapter 4 Database Management Systems. Chapter 4Slide 2 What is a Database Management System (DBMS)?  Database An organized collection of related data.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Software Development Unit 2 Databases What is a database? A collection of data organised in a manner that allows access, retrieval and use of that data.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
November 2009 Network Disaster Recovery October 2014.
Class 6 Data and Business MIS 2000 Updated: September 2012.
Class 3 Data and Business MIS 2000 Updated: January 2014.
SQL Server 2008 Implementation and Maintenance Chapter 7: Performing Backups and Restores.
IT – DBMS Concepts Relational Database Theory.
Security of Data. Key Ideas from syllabus Security of data Understand the importance of and the mechanisms for maintaining data security Understand the.
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
Concepts of Database Management Sixth Edition
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Introduction to Databases A line manager asks, “If data unorganized is like matter unorganized and God created the heavens and earth in six days, how come.
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
1 Maintain System Integrity Maintain Equipment and Consumables ICAS2017B_ICAU2007B Using Computer Operating system ICAU2231B Caring for Technology Backup.
Data and its manifestations. Storage and Retrieval techniques.
The protection of the DB against intentional or unintentional threats using computer-based or non- computer-based controls. Database Security – Part 2.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
- Ahmad Al-Ghoul Data design. 2 learning Objectives Explain data design concepts and data structures Explain data design concepts and data structures.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Introduction to Database AIT632 Chapter 1 Sungchul Hong.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
20/12/20151 Data Structures Backing up and Archiving Data.
A superior collections management system for the world’s largest: Museums Art Galleries Historical Societies Herbaria Botanic Gardens KE EMu.
Class 3 Data and Business MIS 2000 Updated: Jan
Arctos A multi-institution, multi- collection museum database
Accessing MVZ: A Primer and Demo of Arctos, MVZ’s Collection Management System, for Biodiversity Researchers
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
System Components Operating System Services System Calls.
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
KEEPS – a system for UELMA preservation and security
KEEPS – a system for UELMA preservation and security
Data Validation & Security.
Managing Multi-User Databases
Maximum Availability Architecture Enterprise Technology Centre.
Chapter 9 Database and Information Management.
Keeping Member Data Safe
Unit# 5: Internet and Worldwide Web
Introduction to Operating Systems
DATABASE TECHNOLOGIES
Arctos/TACC Collaboration Chris Jordan Texas Advanced Computing Center
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Database management systems
Presentation transcript:

Arctos at the University of Alaska Museum Insect Collection Derek Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University of Alaska Museum Fairbanks, AK 2 Museum of Southwestern Biology, NM Alaska Entomological Society 5 th Annual Meeting, Anchorage, AK Jan 2012

Major repositories using the Arctos database: (43 collections of specimens or observations, 1.4M records)

in partnership with which is a member of TeraGrid – A nationwide network of 11 supercomputing facilities U. S. National Science Foundations Office of Cyberinfrastructure which is sponsored by

Arctos: A 15 year history MVZ: Hired Stan Blum to develop relational data model (following modeling by Assoc. Systematic Collections). MVZ: Hired John Wieczorek to implement model (desktop application) using Sybase and Versata. Partial implementation (e.g., no loans). UAM: John W. migrated mammal data to Oracle, set up Versata. UAM: Dusty McDonald replaced Versata with ColdFusion, implemented full model (first web-based instance, aka Arctos). MSB: 2003 – Joined Arctos at UAM (first multi-hosting instance). MVZ and MCZ: Implemented separate instances of Arctos at Berkeley and Harvard (MVZ: first Postgres, then Oracle). MVZ: Moved hosting of data to Alaska (Virtual Private Database version).

ARCTOS Specimens (objects) - body parts, tissues, containers, etc. Images, media (stored at TACC) Projects, permits, publications Accessions, loans, usage Labels, as PDF files Agents, agent activity Arctos Specimen Catalog label data (and more) Projects contribute and/or use specimens Accessions Loans, usage Publications cite specimens GenBank Federated portals BerkeleyMapper Media in TeraGrid The rest of Cyberspace Citations

BerkeleyMapper & Google Maps, with error circles

Breadth of Data in Arctos Fish, amphibians, reptiles, mammals, birds and bird eggs/nests, plants, arthropods, fossils, molluscs AND their parasites Fish, amphibians, reptiles, mammals, birds and bird eggs/nests, plants, arthropods, fossils, molluscs AND their parasites Specimens and observations Specimens and observations Media (images, audio, video) Media (images, audio, video) Publications, fieldnotes Publications, fieldnotes Arctos constantly evolving to incorporate new kinds of data, e.g.,: Better representation of non-publication documents (fieldnotes, correspondence) Better representation of non-publication documents (fieldnotes, correspondence) Cultural collections (art, anthropology...) Cultural collections (art, anthropology...) Nearly all that is known about an object (or observation) can be included in Arctos.

Linking specimen records to archival documentation…

1)What is the primary user audience? - large/ small museum management? taxonomic research? is a dedicated IT / programmer required? Single vs multi-user? (annual cost?) 2)GBIF - does the database provide data to GBIF? 3)Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use') 4)Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF? 5)What is the ease / difficulty of websetup? 6)Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? Is/are the database server(s) protected from disaster (eg floods, fires)? 7)Likes / dislikes & pros/cons ECN Session – Arthropod Collections Databases

1a) What is the primary user audience? Museums / collections data management (also: observations, Federal collections [USFWS], large private collections associated with public institution] 1b) is a dedicated IT / programmer required? Yes, but the IT staff are shared among all participants. 1c) Single vs multi-user? Multi-user without practical limits. 1d) Annual cost? Negotiated per institution based on size and maintenance needs currently ranging $1,300 - $27,000 ECN Session – Arthropod Collections Databases

2) GBIF - does the database provide data to GBIF? Arctos does this automagically every minute. 3) Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use') Arctos attaches barcodes to parts. This lets you track things like tissues, extractions, slides and pinned bodies of each cataloged specimen separately. ECN Session – Arthropod Collections Databases

4) Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF? Arctos fully supports georeferencing "best practices," in part because the authors of that document and of Arctos' spatial data structure are one and the same. (John Wieczorek) 5) What is the ease / difficulty of websetup? Acquire password. Enter data. (Arctos is only available via the web). ECN Session – Arthropod Collections Databases

Preservation of specimens and their associated data for perpetuity NSF will help us get our data online but ensuring they stay online forever is a problem that hasnt been solved ECN Session – Arthropod Collections Databases

33,090 specimens 28 institutions / private collections 736 images 4,516 bibliographic images 428 users

DMNS Arachnology Data In-house -> NSD -> Crash -> K EMu

Database errors...

Cabinets antiquated wooden damaged = unsafe

Database home-made weak security mine alone not online = unsafe Arctos Specimen Catalog label data (and more) Projects contribute and/or use specimens Accessions Loans, usage Publications cite specimens GenBank Federated portals BerkeleyMapper Media in TeraGrid The rest of Cyberspace Citations

6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? No – Data entry technicians enter data into a staging area Data must be vetted before being loaded by someone with more access privileges All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users. ECN Session – Arthropod Collections Databases

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? Yes – running a RAID array Backups – continuous logs to a remote NAS – local drives – Texas Advanced Computing Center – San Diego Supercomputing Center If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data. Or breathing. – D. McDonald ECN Session – Arthropod Collections Databases

7) Likes / dislikes & pros/cons DISLIKES: - Learning curve fairly steep -> back to kindergarten - Cant customize to my hearts content, each change must be voted on & prioritized by other users - Web access generally slower than I like ( we are all more critical of others than ourselves) - Only available when networked. Field work in remote areas requires special solutions if data are to be accessed. - User interface is ~ garish, clunky, industrial (but works) ECN Session – Arthropod Collections Databases

7) Likes / dislikes & pros/cons LIKES: - Rock – solid security, the data will outlive me - Web-published - Cutting-edge web integration (mapping, GenBank, etc) - No responsibility on my part to maintain backups, software updates, etc. Need only a networked computer - Arctos programmers & designers are biologists / users who really care about doing it right

6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? ECN Session – Arthropod Collections Databases There are multiple roles and partitions at various levels. A data entry technician has write access to exactly one table, the bulkloader. Additionally, one VPD limits his access to his own collection, another limits access to his own rows, and yet another prevents him from marking records to load. In short, he can only un-do anything he's done, and then only in a "staging area" separate from "real" data. A similar model is used throughout Arctos. We control access at the table and row level, and can easily implement finer- grained control if such becomes necessary. Users (theoretically) get only the rights that they need and have demonstrated an understanding of to the data they need, all the while having full access to shared data (like agents). Data like agents and taxonomy - things where character strings rather than data concepts matter to collections - are trigger-protected based on usage. You can't update an agent name after it's been used as an author, for example. This is pretty basic referential integrity, and Arctos is the only thing that has it. Data and user rules are all handled by the RDBMS, so we can plug in forms written by other people/projects, offer SQL command-line access, webservices, etc., without worrying too much about security or referential integrity. (Specify, for example, cannot safely support such access as all data and access rules live in the application layer.) All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users. In addition to ColdFusion's Application Security, we take full advantage of Oracle security - a breach of one just leads to another layer. Oracle handles things like secondary user access and brute-force password crack attempts. An independent semi-intelligent (and slightly paranoid) security wrapper watches for malicious behavior and blocks IP access if it detects anything anomalous.

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? ECN Session – Arthropod Collections Databases The server is running a RAID array - we can lose a disk or two and not lose any data (or stop working). Rollback logs are continuously written to a remote NAS (Networked Attached Storage) system. Daily backups are stored on the local drives, on the NAS, and on tape in GVEA's "bunker." (They won't tell us what or where that is, but your electric bill and medical records are in there and it makes the Department of Homeland Security happy.) Daily backups are also copied to the Texas Advanced Computing Center at Austin (one copy on disk and another on tape) and to tape at the San Diego Supercomputing Center. We may have another copy going to massively redundant disk at the National Center for Supercomputing Applications (University of Illinois at Urbana- Champaign) by the time you get to Reno. We can recover to the point of failure, or at least to within a couple minutes of it, with one copy of the most recent daily backup and one copy of the rollback logs. (Depending on recent activity, we can usually actually recover from a week-or-so old daily + the rollbacks.) We'll lose <24H of data if if we lose all the rollbacks - the sever and the NAS. Those are in two buildings, both with serious security, separated by about a hundred yards of gravel parking lot. If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data. Or breathing. There are a couple dozen probes per day - I think it's fairly safe to say that Arctos security has been tested. (Actual attacks are now kind of hard to detect due to the aforementioned paranoid IP killer, which generally shuts them off at the first probe, but we used to get one per week or so.) A big DDoS attack would easily take us down, but (1) we're too boring to attract such a thing, and (2) so what? - those things just eat servers, not data.

6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? ECN Session – Arthropod Collections Databases We've lost a few disks over the years, but never lost data or had a server go down due to it. (We've had lots of downtime, just not equipment-related.) Our biggest threat is probably a disgruntled employee with too much access and a long-term plan, but we could probably (with expensive consultant help) even recover from that, and there's no lack of tools to detect such behavior. That might all be a little overkill - I'd settle for daily backups on 2 major tectonic plates if absolutely necessary – but I certainly think that you have an obligation to do more than install [database X] on some junker computer and maybe buy a tape drive when you take public money to create or curate digital data. [database X] may be free, but supporting it takes a real commitment in hardware, infrastructure, and expertise that most Universities are poorly equipped to make. I don't know of a single large project that hasn't at some point lost digital data. - Dusty McDonald, Arctos programmer

Lessons Learned 1) Proprietary software is generally a bad idea unless you have guaranteed, sustained budget for staff and upgrades. 2) Back-ups cannot merely be performed/scripted with the assumption that the job is done. 3) Back-ups should NOT be incremental, MUST be stored offsite, and MUST include separate images of operating system and databases 4) Restoration from bare metal must be fully documented and periodically performed to verify that the process DOES work. 5) Source code must be in a distributed public repository like Github. - D. Shorthouse

University of Connecticut Bird Collection data were found... on a single floppy 2031 records in a flat file

University of Connecticut Bird Collection data were found... and made available on-line

But... Something with the server setup is not stable.