Compilation and Design of a Functioning Distributed Database of North American Electric Generating Emissions Stefan Falke Center for Air Pollution Impact

1 Compilation and Design of a Functioning Distributed Database of North American Electric Generating Emissions Stefan Falke Center for Air Pollution Impact and Trend Analysis Washington University in St. Louis Gregory Stella Alpine Geophysics, LLC Terry Keating US EPA – Office of Air & Radiation Demonstration of a Distributed Emissions Inventory using Web Technologies

2 2 Background In support of this longer term goal, the Commission on Environmental Cooperation (CEC) and the US EPA have initiated a project to develop a prototype web tool for enabling uniform access to distributed emissions data from North American electricity generating power plants. Air pollutant emission inventories for the US, Canada, and Mexico are compiled, stored and disseminated using different methods The development of a single comprehensive and accurate emissions inventory is essential for the coordinated reporting, policy development, transport analyses, and socio-economic studies that create an environment for collaboration among international researchers, policy-makers, and the interested public

3 3 Distributed Data and Management Networks Cyberinfrastructure NSF’s initiative to apply new IT to building new ways of conducting collaborative research Earth Observation Summit International effort to build comprehensive, coordinated, and sustained Earth observation systems Ecoinformatics EPA’s vision for national and international cooperation in data and technology development$.startup$.startup Advances in information science and technology are driving the trend toward distributed networks and virtual communities for science and management. Integrated Ocean Observing System International network of ocean related monitoring, assessment, and communication Linked Environments for Atmospheric Discovery Network of high-performance computers and software to gain new insights into weather Virtual Observatory Network for astronomical data sharing and distributed analysis

4 4 Emissions Community Collaborative Activities NIF Data standards ►Standard format and submission ►NEI XML schema Environmental Information Exchange Network ►Network linking EPA, States, and other partners through the Internet and standardized data formats Facility Registry System ►Standard facility codes and locations Data Sharing Efforts ►States, Tribes, Local agencies, RPOs ►North America

5 5 …is both a conceptual framework and implementation effort for the development of a fully integrated, distributed air emissions inventory – and the foundation for an all-media environmental information network –Tie together data at all spatial and temporal scales using emerging distributed database technologies –Provide shared, online tools for processing and analysis –Provide for the seamless merging, manipulation and analysis of Internet accessible air quality-relevant data through the development of emerging Internet-oriented technologies –Make use of existing resources – link and partner with other efforts –Build a broad-based air quality user community: scientists, regulators, policy analysts and the public –Create the network and toolkit piece-wise through multiple, connected projects Networked Environmental Information System for Global Emissions Inventories NEISGEI Ongoing Efforts: NSF-EPA Digital Government Funded Projects: The California Air Resources Network Fire & Air Quality Data and Tools Network Future Effort: EPA OAR RFA on Distributed Air Quality Data in Support of NEISGEI

6 6 CAREN: The California Air Resources Network US EPA RPOs AQMDsMunicipalitiesTribesStates ??? Environmental data sharing among international, national, state and local governments, the public and academic and other non-governmental research organizations is a difficult challenge. ► Barrier: Technological incompatibilities ► Barrier: Data format incompatibilities ► Barrier: Financial (staff time) limitations Eduard Hovy, Jose-Luis Ambite, Andrew Philpot USC Information Sciences Institute The Solution Strategy (First Step): Automate the integration of heterogeneous databases Use semi-automated information integration methods to generate translation protocols between related information sources, e.g. AQMD and CARB.

7 7 Fire and Air Quality Data Network Data Catalog FS Coarse Spatial Data NEI Fire Emissions BLM Fire History Wildland Fire Assessment System HMS Fire Detection Spatial Interpolation Service VIEWS CAPITA Data Wrappers ftp text tables RDBMS Data Sources Data ‘wrappers’ are used to translate the format of data sets into a uniform format. The data are either stored on the CAPITA database server or dynamically accessed from its original source The datasets are registered with metadata in the data catalog GIS-type interfaces provide users with ways to view, analyze, and export the data

8 8 Envisioned Emissions Community Resource of Data & Tools XML GIS Estimation Methods RDBMS Geospatial One-Stop Transport Models Emissions Inventory Catalog Users & Projects Web Tools/Services Emissions Inventories Data Data Catalogs Activity Data Spatial Allocation Comparison of Emissions Methods Data Analysis Model Development Wrappers Emissions Factors Surrogates Report Generation

9 9 Current Project Objectives The project’s focus is on criteria pollutants and toxics because of their availability and accessibility. Recommend and demonstrate to the CEC approaches for the comparability of techniques and methodologies for data gathering and analysis, data management, and electronic data communications for promoting access to publicly available electric utility emissions Identify, collect, and review existing sources of electric generating utility (EGU) emissions and activity databases, and provide a summary of the state-of-science Build a prototype web browser tool to query, retrieve, and explore emissions data from heterogeneous databases Demonstrate the utility of new information technologies in creating an integrated network of distributed data and tools Dynamically link multiple existing data without requiring substantial modification on the provider’s end and provide interfaces that make the links transparent to the end user. “Add value” to the linked data through the application of data analysis and processing tools

10 10 Design Objectives  Distributed  Non-intrusive to data provider  Transparent to end user  Flexible  Extendable  Light on user requirements

11 11 Process of Building Demonstration Identify and access relevant data (build wrappers) Build relational database to temporarily store the data that are not accessible in a distributed manner Acquire authorization and access to those data that are dynamically accessible through internet interfaces Create field name “mappings” among datasets Identify available web technologies for building a distributed emissions tool Develop new components necessary for the prototype Build web tool prototype for demonstrating the feasibility of exploring emissions data

12 12 Available Internet Accessible Emissions Data Data SourceTime CoveragePollutantsReporting Level NEI (US) 1985-1999 (criteria) 1996-1999 (HAPs) NOx, SO2, CO, PM, VOC, HAPs Boiler eGrid (US)1996-2000NOx, SO2, CO2, MercuryBoiler & Generator Clean Air Markets (US) 1980, 1985, 1988-1999NOx, SO2, CO2 Generator NPRI (Canada)1994-2001 HAPs (Criteria starting in 2002) Facility These are publicly available, on-line accessible emissions data. Other data resources are available, but at this time only in hard copy form and therefore not usable in demonstrating distributed database concepts. NEI, NPRI, and eGrid data were downloaded and stored in relational databases on the CAPITA server BRAVO Mexican emissions data were obtained in electronic format and imported to the CAPITA server Clean Air Markets was identified as the source most suitable for demonstrating distributed access

13 13 Emissions Data Characteristics Web browser query Web map server Recent database structure upgrade Web browser query Remotely accessible using SecuRemote Not yet publicly accessible Web browser query Remotely accessible using SecuRemote Oracle database Downloadable Excel Spreadsheets Plans for a dynamic web system were shelved

14 14 Database Fields Mapping Emissions inventories are based on different underlying data models. Each inventory uses a uniquely defined set of field names. However, many of these field names are similar to (or their content is similar to) fields in another country’s inventory. Some of the key relationships among the inventories have been captured by developing a “mapping” among fields. These mappings provide a set of connections that can subsequently be applied to automated query and integration of data from multiple inventories. SO2 SO2Yr SO2_Ann Sulfur Dioxide SO2

15 15 Leveraging Multiple Projects The challenges in distributed information systems should be addressed by collaborative efforts across governments, agencies, researchers, and disciplines. Common underlying goals among projects provide opportunities to naturally design systems to interoperate Avoids one-time, stand alone solutions that cannot be reused

16 16 DataFed.Net The Aerosol Data and Services Federation, (, is a network of providers and users for sharing atmospheric data and processing services. DataFed includes a Community of participants who share and use data and processing services, Mediator software component to homogenize data access and Peer-to-peer computing for composing web applications.Community Mediator Peer-to-peer R. Husar, CAPITA

17 17 Catalog of Air Quality Related Data Metadata for each dataset are registered in a catalog allowing users to browse available datasets and determine which datasets to use for their particular application. The catalog entries includes data access instructions. select data domain: aerosols, emissions, fire … The emissions data are registered in the catalog

18 18 Preview Data

19 19 Modular Applications for Working with Online Data Emissions data is multi-dimensional (plant, year, pollutant, fuel type, boiler capacity, etc.). Its multidimensionality requires multiple “views” of the data in a variety of end-use applications. Data views can be created in the framework, including maps, time series, and tables. Each view is independently linked to its data sources and described (i.e. geographic and temporal extents). Using the data access instructions registered in the catalog, a view can be dynamically assembled. The modularly designed views can be embedded and controlled in web pages using Javascript, ASP or other web application programming languages.

20 20 Web Services Substantial progress has been achieved in data interoperability. One the next advances required is interoperable data analysis/processing tools. Web services are applications that are used over the Web. Because they are self- contained and use XML-based standards (SOAP, WSDL, UDDI) for describing themselves and communicating with other web resources, they can be reused in a variety of independent applications. Many of the analysis and processing tools used by the air quality management community could benefit from web service technology. Not only can their data be shared but their heterogeneous, distributed tools that operate on that data can be shared as well. A longer term vision for web services is to be able to “orchestrate” or “chain” multiple services from multiple providers so that new “third party” applications can be constructed.

21 21 In this example, the user can change the pollutant, date, map zoom data on/off, and map/time scales. Example Web Services Application Emissions data from multiple databases are displayed on maps, time series, and tables. Tools are included for browsing and querying the data.

22 22 North American Emissions Demonstration Data Flow MapPointAccess Data Catalog MapPointRender MapImageOverlay MapImageAccess MapImageRender The settings of each web service can be changed by the user, creating a dynamic application Wrappers = web service BRAVO NPRI NEI eGridCAM DataSet = NPRI Year=1999 Parameter=SO2 MapPointAccess DataSet = eGrid Year=1999 Parameter=SO2 MapPointAccess DataSet = NEI Year=1999 Parameter=SO2 Color = Yellow Symbol=Bar Width=8 MapPointRender Color = Red Symbol=Bar Width=8 MapPointRender Color = Blue Symbol=Bar Width=8 DataSet = N.Am. Borders Layer Order = N.Am, NEI, eGRid, NPRI Color= Maroon Size=2 Name Settings

23 23 Embedded Images and Controllers in Web Page Parameter Controller Date Controller Query Controller The controllers and map image view can be linked and assembled in a web page. Changing the settings of a controller changes the URL of the map image and updates the web page. The web page can be constructed using standard web application programming languages, such as JavaScript and ASP.

24 24 Project Results Technology has been demonstrated to be at the point where we can begin to apply some of the distributed database concepts to “real” situations Emissions data present unique challenges due to their complex relational dimensionality Collaborative efforts in the near future could generate a distributed North American emissions inventory –The initial versions of the inventory would help clarify the issues related to handling complex queries –Building and using a distributed emissions tool will assist in creating consensus data naming conventions

25 25 Current Project Challenges  Dynamic Access to Data - technical snafus - security issues - slow performance  Complexity of Emissions Data  Quickly evolving technology

26 26 What’s Missing Metadata –More complete metadata would help in relating heterogeneous databases More complete access to distributed datasets –A process for creating trusted provider-user agreements would help address issues of security and data misuse More comprehensive content –Networked data and tools that spark additional interest in the technology’s potential Actual Implementations –FASTNet will demonstrate the use of many of these technologies in the real time monitoring of major aerosol events this summer

27 27 Next Steps Advance the prototype tool to make it more representative of what an emissions inventory would ultimately use EPA’s RFA on Distributed Air Quality Data in Support of NEISGEI Link to other web services (Geospatial One-Stop, TerraServer) and other relevant data sources, including Web Map Servers Establish collaborative partnerships with other researchers and agencies developing related networks and tools (ReVa, Earth Science Federation (NASA), NSF Cyberinfrastructure efforts) Build tools that add value to the distributed data and provide incentives for data providers to join networks Clarify the handling of distributed data for optimal system performance

28 28 Mediators  The flow of data from provider to user passes through brokers, or mediators.  The data continues to be maintained by original providers  Contracts are used to retain a constant link to the original data  Data is still not centrally stored but is “cached” in a format that allows efficient queries and analyses  Mediators provide an interface between the user and the data that enhances the effective information exchange between the two sides

29 29 Adding Value to data through Services Server Client Mediator Services Data SourcesData Users multidimensional data cubes analytical web services

