Databases and Global Environmental Change: Information Technology for Sustainable Development Gilberto Câmara INPE, Instituto Nacional de Pesquisas Espaciais Brazilian Academy of Sciences, Annual Meeting, May 2012
source: IGBP How is the Earth’s environment changing, and what are the consequences for human civilization? The fundamental question of our time
Global Change Where are changes taking place? How much change is happening? Who is being impacted by the change?
Limits for Models source: John Barrow (after David Ruelle) Complexity of the phenomenon Uncertainty on basic equations Solar System Dynamics Meteorology Chemical Reactions Hydrological Models Particle Physics Quantum Gravity Living Systems Global Change Social and Economic Systems
Limits for Models source: John Barrow (after David Ruelle) Complexity of the phenomenon Uncertainty on basic equations Solar System Dynamics Meteorology Chemical Reactions Hydrological Models Particle Physics Quantum Gravity Living Systems Global Change Social and Economic Systems e-science
Collaborative e-science Territory (Geography) Money (Economy) Culture (Antropology) Modelling (IT) Connect expertise from different fields Make the different conceptions explicit
Até 10% % 20 – 30% 30 – 40% 40 – 50% 50 – 60% 60 – 70% 70 – 80% 80 – 90% 90 – 100% Amazonia ( km2 = size of Europe) Deforestation in Amazonia
Data (we need a lot of it) Deforestation in Brazilian Amazonia ( ) dropped from 27,000 km 2 to 6,200 km 2
Daily warnings of newly deforested large areas Real-time Deforestation Monitoring
Tb of data lines of code 150 man/years of software dev 200 man/years of interpreters How much it takes to survey Amazonia?
TerraAmazon – open source software for large-scale land change monitoring Spatial database (PostgreSQL with vectors and images) : 5 million polygons, 500 GB images
Terrestrial Airborne Near- Space LEO/MEO Commercial Satellites and Manned Spacecraft Far- Space L1/HEO/GEO TDRSS & Commercial Satellites Deployable Permanent Forecasts & Predictions Aircraft/Balloon Event Tracking and Campaigns User Community Vantage Points Capabilities Welcome to the Age of Data-intensive Science!
Weather and climate source: WMO 11,000 land stations (3000 automated) 900 radiosondes, 3000 aircraft 6000 ships, 1300 buoys 5 polar, 6 geostationary satellites
ARGOS Data Collection System (16000 plats) 650,000 messages processed daily
Argo bouy network
Data chain in Earth System Science fonte: NASA
Data-intensive Science = principles and applications of information technology for handling very large data sets
IT concepts are essential to global change researchers (but most of them don’t know it) Global change challenges will motivate new research in IT (but most of us are not looking there) Conjectures
Which data is out there? How to organize big data? How to get the data I need? Challenges for data-intensive science How to model big data? How to access and use big data?
Stage 1 – A scientist’s personal database Local database User interface Database creationAnalysisDatabase access
Stage 1 – A scientist’s personal database Local database User interface Database creationAnalysisDatabase access The good: data is close to you (or so you think) The bad: no long-term data preservation no data sharing
Stage 2 – A scientific lab database Corporate database User interface Database creation AnalysisDatabase access
Stage 2 – A scientific lab database Corporate database User interface Database creation AnalysisDatabase access The good: long-term data preservation data sharing inside the lab reusable corporate software The bad: substantial costs on data admin little outside data sharing
ECMWF Metview – MOPTC June Metview
ECMWF Metview – MOPTC June Field plotting
Stage 3 – A scientific lab database in the cloud Corporate database User interface Database creation AnalysisDatabase access
Stage 3 – A scientific lab database in the cloud Corporate database User interface Database creation AnalysisDatabase access The good: long-term data preservation shared costs on data admin The bad: rewrite software for cloud processing outside data sharing still not solved
Risk Analysis Analysis
On-line data feed ModelsSatellite/RadarDCP Rain total Fixed time and irregular – alert Point data One file per DCP Grid 4km Total rain 1h Total rain 24h Current (mm/h) Binary file ETA 40, 20, 5 Km Ensemble 40 Km Total rain 72h 72 files ASCII grid file
TerraMA 2 - Natural Disasters Monitoring and Alert System
Stage 4 – Multidatabase access Data source Data source Data source Modelling Data discoveryData accessAnalysis Remote Analysis
Stage 4 – Multidatabase access Data source Data source Data source Modelling Data discoveryData accessAnalysis Remote Analysis The good: long-term data preservation shared costs on data admin access to large external database The bad: rewrite software for cloud processing finding data is a major problem
Data Access Hitting a Wall Current science practice based on data download How do you download a petabyte?
Data Access Hitting a Wall Current science practice based on data download How do you download a petabyte? You don’t! Move the software to the archive
Scientific Data Management in the Coming Decade (Jim Gray, 2005) Next-generation science instruments and simulations will produce peta-scale datasets. Such peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. The procedural stream-of- bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Database systems will be judged by their support of common metadata standards and by their ability to manage and access peta-scale datasets.
36 Virtual Observatory If data is online, internet is the world ’ s best telescope Scientific Data Management in the Coming Decade (Jim Gray)
Where is scientific database going?
From tables to arrays nomeCPF cargo SQL language selection, projection, join, relation (table) SELECT * FROM images WHERE date=“today ” relational algebra SELECT Mean (A.B) FROM Array A AQL language Spatial queries, Math operations Scientific data Array Algebra
Communicating concepts is hard Image source: WMO vulnerability? climate change? poverty?
degradation We’re bad at representing meaning deforestation? degradation? disturbance? Communicating concepts is hard
When did the Aral Sea reach the tipping point? Communicating change is very hard
Describing events and processes is very hard When did the flood occur?
Earth System Science data management poses a major challenge for the database community We need new techniques, architectures and data handling techniques to deal with scientific data Conclusions