Download presentation
Presentation is loading. Please wait.
Published byBerniece Hill Modified over 9 years ago
1
Big Data Challenges in a Data Center Workflow Eric A. Kihn, Dan E. Kowal, and John C. Cartwright – NOAA/NESDIS National Geophysical Data Center (NGDC), Boulder, Colorado, USA January 6, 2015 U.S. national archive for operational space weather data Maintains global ionospheric datasets Mirror site for National Geodedic Survey (NGS) Continuously Operating Reference System (CORS) Earth Observations Group (including Nighttime Lights Products) Solar and Terrestrial Physics Division Marine Geology and Geophysics Division Data ingest, archive, metadata, and access – our base mission Technical expertise - software development, GIS, DBA, web, network Provides the backbone that enables science and stewardship Information Services Division Coastal resiliency Hazard warnings and mitigation Ocean and coastal charting and mapping Exploration of the U.S. outer continental shelf Geomagnetic applications The Mission of NOAA's National Geophysical Data Center (NGDC) is to provide long-term scientific data stewardship for the Nation's geophysical data, ensuring quality, integrity, and accessibility. NGDC provides stewardship, products, and services for geophysical data from our Sun to Earth and Earth's sea floor and solid earth environment, including Earth observations from space. As part of its mission, NGDC executes preservation workflows which include ingest, quality control, metadata generation, product generation, and development of access methods for diverse data types. Each phase of proper stewardship involves challenges when it comes to Big Data. This poster will look at Big Data as it interacts with, and is supported by, the stewardship workflow of a National Data Center. We will present tools and techniques as well as identify remaining challenges as we continue to march into the Big Data era.geophysical data Abstract Big Data Challenges/Solutions Receive Data: The challenge for the data center is two-fold. First comes providing adequate networking/transmission to the data providers. Because the data center is a hub (i.e., taking in data from many sources), it requires a tremendous network resource. NOAA’s data centers benefit from N-Wave which is a highly scalable, stable and secure network built using 10GB per second Wave Division Multiplexed (WDM) fiber-optic links supplied by partners in the national Research and Education (R&E) network community. Additionally, the data center must provide adequate cache for data sets that arrive intermittently (i.e., ship cruise data). Unlike satellite data, whose flow is regular, these data need to be held before processing depending on downstream capabilities and metered into the flow. Archive Data: In the archive area the big data challenge breaks down into a cost/access balance and knowing what to retain. The bulk of NOAA data centers’ storage is provide by the Comprehensive Large Array-data Stewardship System (CLASS). The storage is fully replicated and primarily tape-based, with a spinning disk cache for frequently accessed data in front. Because the tape costs have fallen (6.25 Tb / LTO 6) the lack of synchronous access is offset by the volume. Typically, demand for the archived data falls dramatically after 30 days and proper use of the front end cache ensures a proper balance. In addition, NGDC follows a rigorous “What to archive process” (shown below). This assures only principle data is archived. Process Data: The data center has moved to a virtualized server architecture, which will eventually transfer to a cloud based processing infrastructure. The promise of cloud in Big Data is hard to ignore. In particular, the ability to repurpose and dynamically recruit servers in a anomalies, and statistics relevant to a designated user community. NGDC allows for custom metadata extraction on ingest as part of its enterprise ingest system. Provide Access: A key element of access is to provide sub-setting by data properties to minimize archive traffic. This is enabled through good metadata practices (above) and intelligent interfaces, including GIS, fuzzy a private or hybrid approach is essential to handling large volumes of diverse data. A key consideration is to have the archive and processing co-located to minimize network traffic. Create/ Manage Metadata: In addition to extracting metadata from the large volumes and preserving it in a standard form, metadata plays a key part in discovery. Because the bulk of the data is stored asynchronously, metadata choices determine what properties can be searched for discovery. Beyond just the typical lat/lon and time, the metadata extracted on ingest must cover such things as data quality logic based search tools, and the development of tools focused on specific user communities. A well designed interface and archive extraction can reduce delivered volumes more than 50% while better serving the community. Conclusion: Big Data touches the data center workflow at most elements. While the emergence of cloud, high speed networking and low cost storage provide relief in some elements, still more work remains. The fact that most data collected is not viewed directly by a data manager means tools such as data mining, neural networks and expert systems must be brought on-line in order to assure data quality, accessibility and utilization as we continue to see exponential increases in volume and numbers of data sets. Workflow for a National Data Center Principle Data Center Workflows Activities Create/Manage Metadata: Metadata, or more recently “documentation”, about data is a crucial element in stewarding data, particularly for the long term. It is important to document content, origin, accessibility and other issues surrounding the archived data in order to preserve information content. The IT infrastructure to support this activity must be robust and adaptable as standards and content models for metadata change over time. Process Data: A large percentage of the work involved in science data stewardship is involved with processing data. This work includes things like product generation, data quality control, and data repackaging, as well as metadata update and generation which can come from this step. Taken as a whole, the process data steps constitute the core of data center operations and business execution. Not surprisingly, a large percentage of the IT architecture is focused on supporting these activities. Provide Access: The access function provides data to the end user in a format that is accessible and reasonable and often provides additional capabilities for browsing and data manipulation. The data centers invest a tremendous amount of IT resources in the Access layer because of the need to support multiple user communities, integrate with other systems, and present a professional face to the community. The number of these systems will likely draw down in the out years as the data center becomes more integrated with extant enterprise efforts and efforts within designated user communities on the front end. Big Data Drivers Interact with Provider Process Data Create/Manage Metadata Receive Data Archive DataProvide Access NOAA Big Data Initiative Request For Information: https://www.fbo.gov/index?tabid=c4db11d56506c1aee7155e2b0 a4088a4 ) Public Access to Research Results (PARR) www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_pu blic_access_memo_2013.pdf U.S. Office of Science Technology Policy (OSTP) Memo issued February 22, 2013 - “Increasing Access to the Results of Federally Funded Scientific Research” - Requires agencies with $100M in annual research to ensure, to the greatest extent and with the fewest constraints possible, the direct results of scientific research are made available to and useful for the public, industry, and the scientific community. Big Environmental Data Initiative (BEDI) https://www.nosc.noaa.gov/EDMC/documents/workshop2013/DL B-0130625_EDM_Virtual_Workshop_Opening_Remarks.pdf A multi‐agency activity coordinated through U.S. Group on Earth Observations (USGEO) - Improve discoverability, accessibility, and usability of data. Focus on "high value" datasets, e.g. from OSTP Earth Observations Assessment, USGCRP National Climate Assessment, NOAA Observing Systems of Record. Interact with the Provider: Interaction with the provider covers work-like submission agreement development, tracking of new data submissions, and development of Interface Control Documents (ICDs). These interactions tend to be largely manual as they most often involve person-to-person contact and negotiation. The IT component in this space covers document management and tracking systems used by the data manager to share, develop, and manage documents with a controlled team. Receive Data: Few functions of the data center are more important that the receipt of data. This function ensures that data is accepted from the provider, validated, and taken into the NGDC archive system for proper stewardship. Part of the receipt of data is the proper tracking of the submission packets and notification of the relevant data managers or data management systems. Archive Data: Archive of data covers the act of storage and preservation for the bits representing the data without regard for content preservation. Typically, this is done using tape, disk or other media which conform to NARA standards. Creative uses, U.S. Open Data Policy, Equal access on equal terms, Self-sustaining business model, Data handling and curation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.