Big Data Challenges in a Data Center Workflow Eric A. Kihn, Dan E. Kowal, and John C. Cartwright – NOAA/NESDIS National Geophysical Data Center (NGDC),

Slides:



Advertisements
Similar presentations
Data and Information Framework: Principles Sue Barrell Bureau of Meteorology, Australia CBS-Ext.(14), Asuncion, September 2014.
Advertisements

BEDI -Big Earth Data Initiative
Integrated Ocean Observing System (IOOS) Data Management and Communication (DMAC) Standards Process Julie Bosch NOAA National Coastal Data Development.
NWS Support for a National Mesonet Network of Weather & Climate Observing Networks (NOWCON) Don Berchoff Director, Office of Science and Technology NOAA.
Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements Workforce Demand and Career Opportunities From.
1 Adaptive Management Portal April
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Open Dialogue on Digital Data management
Knowledge Portals and Knowledge Management Tools
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
Plans for AWIPS Next Generation 1April 12, 2005 AWIPS Technology Infusion Darien Davis NOAA/OAR Forecast Systems Laboratory Systems Development Division.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
New Generation SDI and Cyber-Infrastructure Prof. Guoqing Li CEODE/CAS March 29, 2009, Newport Beach, USA Presented to 4th China-US Roundtable Meeting.
Coastal and Marine Spatial Planning Debra Hernandez, SECOORA Walter Johnson, BOEMRE Ru Morrison, NERACOOS Charly Alexander, IOOS Office.
Research Data at NCAR 1 August, 2002 Steven Worley Scientific Computing Division Data Support Section.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
1 Next Generation of Operational Earth Observations From the National Polar-Orbiting Operational Environmental Satellite System (NPOESS): Program Overview.
material assembled from the web pages at
Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.
Planning for Arctic GIS and Geographic Information Infrastructure Sponsored by the Arctic Research Support and Logistics Program 30 October 2003 Seattle,
Chapter 9 Section 2 : Storage Networking Technologies and Virtualization.
The Data Center of the 21 st Century John Bates NOAA National Climatic Data Center.
NOAA Report David Clark NOAA/NESDIS/NGDC CEOS/WGISS 20 Kyiv, Ukraine September 8, 2005.
Why Archiving and Preserving GIS Data Is Important Maps tell a compelling story of change over time. They document movement, progress, and change to the.
‘intelligent openness’ The common objective of an RCUK data policy Gregor McDonagh
MEDIN Partners Meeting 2010 Submitting data to and using Data Archive Centres.
M u l t I b e a m III W o r k s h o p M u l t I b e a m III W o r k s h o p National Geophysical Data Center / World Data Centers NOAA Slide 1 End-to-End.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
NIST Data Science SymposiumMarch 4, 2014 NIST Data Science SymposiumMarch 4, Climate Archives in NOAA: Challenges and Opportunities March 4, 2014.
NOAA National Geophysical Data Center & collocated World Data Centers, Boulder CO USA World Data Center for Marine Geology and Geophysics, Boulder, CO.
MEDIN Work Plan for By March 2011 MEDIN will be 3 years into the original 5 year development plan started in Would normally ask for continued.
User Working Group 2013 Data Access Mechanisms – Status 12 March 2013
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
DRAFT EDMC Procedural Directives NOAA Environmental Data Management Committee 12/3/2015 1
NOAA/NESDIS/National Oceanographic Data Center Following the Flow of Two Underway Data Streams Within the U. S. National Oceanographic Data Center Steven.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
NOAA Report WGISS 19 Climate and Meteorology Status Glenn K. Rutledge NOAA Cordoba, Argentina March 7,2005.
WGISS and GEO Activities Kathy Fontaine NASA March 13, 2007 eGY Boulder, CO.
Science Data in the Science Mission Directorate (SMD) Jeffrey J.E. Hayes Program Executive for MO & DA, Heliophysics Division August 17, 2011.
National Archives and Records Administration Status of the ERA Project RACO Chicago Meg Phillips August 24, 2010.
Space Observations Ocean Observations Land Surface Observations Atmospheric Observations Environmental Data at NOAA.
A Proposed Short Course on Data Stewardship Scott Hausman Deputy Director NOAA’s National Climatic Data Center Preparing Scientists to Steward Their Data.
DISCUSSION DRAFT ONLY Data Management METRICS for NNDC and CLASS David Hermreck.
DOE Data Management Plan Requirements
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Trials and Tribulations of a Small Archive Presented at the THIC Conference, NCAR, Boulder CO June 30, 2004 Presented at the THIC Meeting at the National.
1 1 NOAA Office of Ocean Exploration End-to-End Data Management: A Success Story NOAA Tech Conference November 2005 Susan Gottfried National Coastal Data.
Ed Kearns National Climatic Data Center Asheville, NC.
Long Term Archival of ECS Data Held at the National Snow and Ice Data Center.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
U N I T E D S T A T E S D E P A R T M E N T O F C O M M E R C E N A T I O N A L O C E A N I C A N D A T M O S P H E R I C A D M I N I S T R A T I O N.
5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
1. 2 NOAA’s Mission To describe and predict changes in the Earth’s environment. To conserve and manage the Nation’s coastal and marine resources to ensure.
CEOS Working Group on Information System and Services (WGISS) Data Access Infrastructure and Interoperability Standards Andrew Mitchell - NASA Goddard.
CONTEXT FOR THE REVIEW Gary Matlock, Ph.D. Deputy Assistant Administrator for Programs and Administration (A) Office of Oceanic & Atmospheric Research.
R2R ↔ NODC Steve Rutz NODC Observing Systems Team Leader May 12, 2011 Presented by L. Pikula, IODE OceanTeacher Course Data Management for Information.
WMO WIS strategy – Life cycle data management WIS strategy – Life cycle data management Matteo Dell’Acqua.
2nd GEO Data Providers workshop (20-21 April 2017, Florence, Italy)
The NOAA Big Data Project ESIP Cloud Computing Panel
New Heights by Guiding Them into the Cloud
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Research data preservation in Canada
Data Management Components for a Research Data Archive
Presentation transcript:

Big Data Challenges in a Data Center Workflow Eric A. Kihn, Dan E. Kowal, and John C. Cartwright – NOAA/NESDIS National Geophysical Data Center (NGDC), Boulder, Colorado, USA January 6, 2015 U.S. national archive for operational space weather data Maintains global ionospheric datasets Mirror site for National Geodedic Survey (NGS) Continuously Operating Reference System (CORS) Earth Observations Group (including Nighttime Lights Products) Solar and Terrestrial Physics Division Marine Geology and Geophysics Division Data ingest, archive, metadata, and access – our base mission Technical expertise - software development, GIS, DBA, web, network Provides the backbone that enables science and stewardship Information Services Division Coastal resiliency Hazard warnings and mitigation Ocean and coastal charting and mapping Exploration of the U.S. outer continental shelf Geomagnetic applications The Mission of NOAA's National Geophysical Data Center (NGDC) is to provide long-term scientific data stewardship for the Nation's geophysical data, ensuring quality, integrity, and accessibility. NGDC provides stewardship, products, and services for geophysical data from our Sun to Earth and Earth's sea floor and solid earth environment, including Earth observations from space. As part of its mission, NGDC executes preservation workflows which include ingest, quality control, metadata generation, product generation, and development of access methods for diverse data types. Each phase of proper stewardship involves challenges when it comes to Big Data. This poster will look at Big Data as it interacts with, and is supported by, the stewardship workflow of a National Data Center. We will present tools and techniques as well as identify remaining challenges as we continue to march into the Big Data era.geophysical data Abstract Big Data Challenges/Solutions Receive Data: The challenge for the data center is two-fold. First comes providing adequate networking/transmission to the data providers. Because the data center is a hub (i.e., taking in data from many sources), it requires a tremendous network resource. NOAA’s data centers benefit from N-Wave which is a highly scalable, stable and secure network built using 10GB per second Wave Division Multiplexed (WDM) fiber-optic links supplied by partners in the national Research and Education (R&E) network community. Additionally, the data center must provide adequate cache for data sets that arrive intermittently (i.e., ship cruise data). Unlike satellite data, whose flow is regular, these data need to be held before processing depending on downstream capabilities and metered into the flow. Archive Data: In the archive area the big data challenge breaks down into a cost/access balance and knowing what to retain. The bulk of NOAA data centers’ storage is provide by the Comprehensive Large Array-data Stewardship System (CLASS). The storage is fully replicated and primarily tape-based, with a spinning disk cache for frequently accessed data in front. Because the tape costs have fallen (6.25 Tb / LTO 6) the lack of synchronous access is offset by the volume. Typically, demand for the archived data falls dramatically after 30 days and proper use of the front end cache ensures a proper balance. In addition, NGDC follows a rigorous “What to archive process” (shown below). This assures only principle data is archived. Process Data: The data center has moved to a virtualized server architecture, which will eventually transfer to a cloud based processing infrastructure. The promise of cloud in Big Data is hard to ignore. In particular, the ability to repurpose and dynamically recruit servers in a anomalies, and statistics relevant to a designated user community. NGDC allows for custom metadata extraction on ingest as part of its enterprise ingest system. Provide Access: A key element of access is to provide sub-setting by data properties to minimize archive traffic. This is enabled through good metadata practices (above) and intelligent interfaces, including GIS, fuzzy a private or hybrid approach is essential to handling large volumes of diverse data. A key consideration is to have the archive and processing co-located to minimize network traffic. Create/ Manage Metadata: In addition to extracting metadata from the large volumes and preserving it in a standard form, metadata plays a key part in discovery. Because the bulk of the data is stored asynchronously, metadata choices determine what properties can be searched for discovery. Beyond just the typical lat/lon and time, the metadata extracted on ingest must cover such things as data quality logic based search tools, and the development of tools focused on specific user communities. A well designed interface and archive extraction can reduce delivered volumes more than 50% while better serving the community. Conclusion: Big Data touches the data center workflow at most elements. While the emergence of cloud, high speed networking and low cost storage provide relief in some elements, still more work remains. The fact that most data collected is not viewed directly by a data manager means tools such as data mining, neural networks and expert systems must be brought on-line in order to assure data quality, accessibility and utilization as we continue to see exponential increases in volume and numbers of data sets. Workflow for a National Data Center Principle Data Center Workflows Activities Create/Manage Metadata: Metadata, or more recently “documentation”, about data is a crucial element in stewarding data, particularly for the long term. It is important to document content, origin, accessibility and other issues surrounding the archived data in order to preserve information content. The IT infrastructure to support this activity must be robust and adaptable as standards and content models for metadata change over time. Process Data: A large percentage of the work involved in science data stewardship is involved with processing data. This work includes things like product generation, data quality control, and data repackaging, as well as metadata update and generation which can come from this step. Taken as a whole, the process data steps constitute the core of data center operations and business execution. Not surprisingly, a large percentage of the IT architecture is focused on supporting these activities. Provide Access: The access function provides data to the end user in a format that is accessible and reasonable and often provides additional capabilities for browsing and data manipulation. The data centers invest a tremendous amount of IT resources in the Access layer because of the need to support multiple user communities, integrate with other systems, and present a professional face to the community. The number of these systems will likely draw down in the out years as the data center becomes more integrated with extant enterprise efforts and efforts within designated user communities on the front end. Big Data Drivers Interact with Provider Process Data Create/Manage Metadata Receive Data Archive DataProvide Access NOAA Big Data Initiative Request For Information: a4088a4 ) Public Access to Research Results (PARR) blic_access_memo_2013.pdf U.S. Office of Science Technology Policy (OSTP) Memo issued February 22, “Increasing Access to the Results of Federally Funded Scientific Research” - Requires agencies with $100M in annual research to ensure, to the greatest extent and with the fewest constraints possible, the direct results of scientific research are made available to and useful for the public, industry, and the scientific community. Big Environmental Data Initiative (BEDI) B _EDM_Virtual_Workshop_Opening_Remarks.pdf A multi‐agency activity coordinated through U.S. Group on Earth Observations (USGEO) - Improve discoverability, accessibility, and usability of data. Focus on "high value" datasets, e.g. from OSTP Earth Observations Assessment, USGCRP National Climate Assessment, NOAA Observing Systems of Record. Interact with the Provider: Interaction with the provider covers work-like submission agreement development, tracking of new data submissions, and development of Interface Control Documents (ICDs). These interactions tend to be largely manual as they most often involve person-to-person contact and negotiation. The IT component in this space covers document management and tracking systems used by the data manager to share, develop, and manage documents with a controlled team. Receive Data: Few functions of the data center are more important that the receipt of data. This function ensures that data is accepted from the provider, validated, and taken into the NGDC archive system for proper stewardship. Part of the receipt of data is the proper tracking of the submission packets and notification of the relevant data managers or data management systems. Archive Data: Archive of data covers the act of storage and preservation for the bits representing the data without regard for content preservation. Typically, this is done using tape, disk or other media which conform to NARA standards. Creative uses, U.S. Open Data Policy, Equal access on equal terms, Self-sustaining business model, Data handling and curation.