5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS Bill KampBill Kamp, Lumnilogical Research Center, Univ of Minnesota, Bill Kamp
5/19/05 New Geoscience Applications 2 The Corewall
5/19/05 New Geoscience Applications 3 Overview The data required for a core interpretation session can be very large. An individual IODP core's data can be in the 10 to 100 gigabyte range. To compound this problem, many users will be interpreting at locations with slow internet connections. Finally users may be interpreting data from databases that are often designed as read-only archives and not designed to hold ‘works in progress' of investigators. Our goal is to provide a very smart clipboard.
5/19/05 New Geoscience Applications 4 The Data Requirement Demand a Database Workflow Oriented Large Throughput Internet Aware Accept all data types Locally and Remotely Connect to Geowall Integrate with legacy Tools And most Importantly – Transparent –Little or no CWD work by the Researcher Automatic, automatic, automatic
5/19/05 New Geoscience Applications 5 Legacy Tools Core Log Integration Platform from Lamont-Doherty Earth Observatory (LDEO) Lamont-Doherty Earth Observatory (LDEO) Lamont-Doherty Earth Observatory (LDEO) –Splicer: Provides interactive depth- shifting of multiple holes of core data to build composite sections Splicercomposite sectionsSplicercomposite sections –Sagan: Allows the composite sections output by Splicer to be mapped to their true stratigraphic depths, unifying core and log records Sagan
5/19/05 New Geoscience Applications 6 Sample Plot
5/19/05 New Geoscience Applications 7 Interfaces We will provide interfaces that enable the CWD (Computer Workflow Database) to retrieve user selected data from established databases such as JANUS, LacCore Vault, dbSEABED, and PaleoStrat. We hope to also pull data through the emerging portals such as CHRONOS. The result is fast cached access to multiple data sources.
5/19/05 New Geoscience Applications 8 Features The CWD captures the results of analyses and interpretations. As the workflow is captured it can be accessed by other collaborators locally or remotely. In a high bandwidth environment, such as a core lab or a university office, a group of collaborators could track the work of one-another as they work on the same cores. In a low-bandwidth environment we will cache the data locally upon first access. In a zero-bandwidth environment, the CDW can be copied to a portable mass storage device: All pointers are relative to the location of the CWD.
5/19/05 New Geoscience Applications 9 Coordinate Systems Co-registration across coordinate systems, e.g. wire length, geologic boundary, and/or geologic age. We use the standard algorithms from SAGAN and SPLICER for this purpose. We intend to take advantage of existing technologies such as the Storage Resource Broker and Meta-data Catalog [SRBMDC] to facilitate the locating of replicated data-sets We will use SESAR identifiers to uniquely and automatically identify the sample and the author and the experiment when the data is loaded.
5/19/05 New Geoscience Applications 10 Database Design The paradigm for the metadata is: paradigm –Author –Experiment –Raw Data –Presentation Data type is missing: We support all mime data types –XML and Text stored in the database –All other data stored in the Bin Cache
5/19/05 New Geoscience Applications 11 The Data Diagram The Data Diagram
5/19/05 New Geoscience Applications 12 Caches Uploading requires a caching system –Upload Cache, accessed Directly FTP HTTP upload –Archive Cache: All data is stored in raw form in an archive that is permanent –Staging: A temporary holding place for data while it is examined and transformed –Bin Cache: The location of the binary data managed by the database The complete uploading process, including automatic recognition of the data type, is available as a single script, called ForceUpload. –It is the best way when you have multiple data sets of the same data type.
5/19/05 New Geoscience Applications 13 Data Access All raw data is available via URL’s. The author has the option of refining the automatically generated presentation, i.e. the HTML page that shows the data. Presentations can be dynamically built using database data. Tools are provided. If data is not local, it is transferred to the local bin cache, and the CWD is updated. If you are not on the internet you need to bring with you the database (small) and the bin cache
5/19/05 New Geoscience Applications 14 Sample Presentations readme.txt.html readme.txt.html cwilocs.zip.html cwilocs.zip.html logo.bmp.html logo.bmp.html kamp_1218c_021x_07.jpg.html kamp_1218c_021x_07.jpg.html 1.7.MOLE-JUAN03-1A.Geotek.and.L-a- b.data.xls.html 1.7.MOLE-JUAN03-1A.Geotek.and.L-a- b.data.xls.html 1.7.MOLE-JUAN03-1A.Geotek.and.L-a- b.data.xls.html GLAD4-HVT03-4B-9H-1.BMP.html GLAD4-HVT03-4B-9H-1.BMP.html GLAD4-HVT03-4C-1H-1.BMP.html GLAD4-HVT03-4C-1H-1.BMP.html 7.93.GLAD4-HVT03-4B-1H-1.BMP.html 7.93.GLAD4-HVT03-4B-1H-1.BMP.html
5/19/05 New Geoscience Applications 15Replication The data base is replicated to multiple sites on the internet automatically via TCP/IP. This is a MySql feature. The URL of the data is sent to the replicated database. If upon the first access, if the data is not local, it is fetched to the bin cache via a URL, and the pointers in the local CWD are updated. Currently we have a parent-child relationship: All data is first uploaded to the main CWD. When we complete the integration of SESAR identifiers, the design will support peer-to-peer relationships.
5/19/05 New Geoscience Applications 16 Database Access Data uploaded via a web site Data uploaded via a web site Data uploaded via a web site Data pulled out the CWD via Corewall Data will automatically cross load to other DB’s such as Chronos when there is a meta-data match The latter will be enforced via XSLT’s
5/19/05 New Geoscience Applications 17 Current State Test versions are on the web: Currently at Soon to be at Documented at Currently holds 10 GByte of test data