New ways in Big Data Management for NWP Dr. Dieter Schröder, Dr. Jürgen Seib and Dr. Jochen Dibbern Deutscher Wetterdienst Frankfurter Straße 135 D-63067 Offenbach, Germany CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Data Volume – How Big is Big? Gigabyte - 210*3 Terabyte - 210*4 Petabyte - 210*5 Exabyte - 210*6 Zettabyte - 210*7 Yottabyte - 210*8 Brontobyte* - 210*9 Gegobyte*- 210*10 *This terminology is still subject to change. CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Why is a new management for NWP data needed? Increase of remote sensing data Higher resolution of NWP models Probabilistic vs. Deterministic forecasts Multi-variable data analysis in now-casting systems CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
How fast can we read a Petabyte? Read speed of a storage disc: 100 Megabyte per second (MBps) Bytes Second Hour Month MB 1.048.576 0,01 GB 1.073.741.824 10,24 TB 1.099.511.627.776 10.485,76 2,91 PB 1.125.899.906.842.624 10.737.418,24 2982,62 > 4 CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
File-based Management Application Grib-Management-System File-Distribution-System Main memory Disc storage /tmp Input store Grib files Grib field store CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Grib-Management-System Database-oriented Management Database Application Grib-Management-System DB-Mirror Application DB-Mirror Application DB-Mirror Application Main memory Disc storage Grib files Grib field store CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Grib Data Management System Type of Grib data store: File-based Database-oriented Smallest access unit: Grib file Grib field Grib value at grid point CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Goals Database-oriented management of Grib data such that The smallest access unit will be the Grib value at a grid point The size of the database will not be bigger than the Grib store of a file-based Grib-Management-System The insert of Grib data into the database is not slower than the insert into a file-based Grib-Management-System Requests for spatial-temporal analytics can be formulated with SQL The database is also able to store vector data (polygons, lines, points, etc.) in order to have a common store for all types of meteorological data CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Snowflake model grid point forecast step runtime forecast value ensemble member level CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
GRIB_DATA table Coordinate values Grib values Grid point Runtime Forecast step Level Ensemble member Value 1 2016-11-02 2 123 4 86 3 99 255 2016-11-03 155 5 33 6 145 7 16 12 Coordinate values have to be stored for each Grib value More storage needed for the coordinates than for the Grib values CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Revised relational data model Grid Id integer Point geometry Queries will be slow with classic row store database systems CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Row storage Tables Tablespace select column1 from tab CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Column storage Tables Tablespace select column1 from tab CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
PoC Hardware Environment Fujitsu Server PRIMERGY RX900 S2 with 8 sockets / 80 cores / 160 threads Processor: Intel® Xeon® processor E7-8860 @ 2.27GHz Memory: 2 TB of RAM 3298.5 GB of SAN storage from a NetApp filer over Emulex Corporation Saturn: LightPulse Fibre Channel Host Adapter (rev 03) @ 8Gbit CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
PoC Dataset All forecast data of the ensemble prediction system COSMO-DE-EPS for one day Forecast range: 27 h / 45 h Forecast runs at: 00, 03, 06, 09, 12, 15, 18, 21 UTC Ensemble members: 20 Multi-level parameters: 9 on 50 vertical levels Single-level parameters: 101 Mesh size: 2.8 km (421 * 461 grid points) 4498 Grib2 files per day where each file contains 551 grib fields Size: 928 GB CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Storage overview in SAP Hana 719 GB CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Sample query 1 Get the predicted minimum, maximum and average values of the 2m temperature within a given area. CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Sample query 2 Get the probability at each grid point that it will be warmer than a given threshold value. The calculation should be based on the results of two forecast runs. select point_id, 100 * count(value) / 40 from t_2m where (runtime = ? or runtime = ?) and forecasttime = ? and value > ? group by point_id order by point_id CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Performance Comparison Assume that the time for reading a Grib field from disk will take 10 ms CBS TECO 21. – 22. Nov. 2016 Guangzhou, China
Summary Analysis of Grib fields with a database system is feasible The size of the database will not be bigger than the Grib store of a file-based Grib-Management-System Time for database import of Grib data needs further optimisation Requests for spatial-temporal analytics of both vector and raster data can be formulated with SQL Relocation of meteorological analysis functionality into the database Full advantage of database features, e.g. replication, persistence, query optimisation, parallelisation, concurrent access, geo-spatial extensions, etc. CBS TECO 21. – 22. Nov. 2016 Guangzhou, China