EPA Big Data Analytics: EnviroAtlas Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community April 17,
Overview EPA EnviroAtlas Data: Web Page Description Maps Scales – National and Community Geodatabases-to-Shape Files: FME Workbench Results Data Science Data Publication: MindTouch Knowledge Bases Spreadsheet Knowledge Base Indices and Tables Spotfire Analytics and Visualizations: Cover Page – Knowledge Base Content Analytics IRM Strategic Plan Tables EnviroAtlas Inventories Selected National Metrics 2
EPA EnviroAtlas Data: Web Page 3
EPA EnviroAtlas Data: Description EnviroAtlas national and community data are available to download below as geodatabases. Due to technical limitations which we are working to overcome, not all of the EnviroAtlas data (e.g., 1- meter landcover data, supplemental data) are available for download. As of February 2015, the EnviroAtlas is transitioning to a more recent version of the 12-digit HUCs, data aggregated to these new boundaries will be available soon. All available EnviroAtlas data for each community, except the landcover, is included in the individual geodatabase files below. Durham, NC metric tables in Esri FileGeodatabase format (compressed [36 MB]) Fresno, CA metric tables in Esri FileGeodatabase format (compressed [7 MB]) Green Bay, WI metric tables in Esri FileGeodatabase format (compressed [9 MB]) Milwaukee, WI metric tables in Esri FileGeodatabase format (compressed [31 MB]) New Bedford, MA metric tables in Esri FileGeodatabase format (compressed [4 MB]) Phoenix, AZ metric tables in Esri FileGeodatabase format (compressed [74 MB]) Pittsburgh, PA metric tables in Esri FileGeodatabase format (compressed [31 MB]) Portland, ME metric tables in Esri FileGeodatabase format (compressed [22 MB]) Tampa, FL metric tables in Esri FileGeodatabase format (compressed [53 MB]) Woodbine, IA metric tables in Esri FileGeodatabase format (compressed [658 KB]) 4
EPA EnviroAtlas Data: Maps 5
EPA EnviroAtlas Data: National Maps at the national extent provide wall-to-wall data coverage for the coterminous U.S. These data layers are summarized by 12 digit hydrologic watershed basins (12-digit HUCs) and provide approximately 90,000 similarly sized spatial units. A list of the currently available data is accessible as a.pdf, an.xls file, or as a tab-delimited text file (National file). This file shows the benefit categories under which each layer can be found. Supplemental maps for the nation provide context and additional data for exploring ecosystem services and the built environment. These data are not summarized by a specific spatial unit. Instead, these supplemental maps represent features in the landscape such as rivers and wetlands, as well as other contextual landmarks such as state boundaries. Details on each supplemental map can be found in the data fact sheets. 6
EPA EnviroAtlas Data: Community Community-level information in EnviroAtlas draws from fine scale land cover data, census data, and models to estimate ecosystem services and their benefits within the community area. EnviroAtlas community data are consistent for each available community, and are mostly summarized by census block groups. EnviroAtlas is building datasets for 50 communities in the United States; each community area boundary is based on selected block groups within the 2010 US Census Urban Area boundary. See a list of the available and upcoming communities. Learn more in the Community Fact Sheet (pp, 997K) or download a list of all the EnviroAtlas data available for each community as a.pdf), an.xls file, or as a tab-delimited text file (Community file). This file shows the benefit categories under which each layer can be found. Supplemental maps for each community provide context and additional data for exploring ecosystem services and the built environment. These data are not summarized by a specific spatial unit and include the 1 meter resolution land cover data for each community. Details on each supplemental map can be found in the data fact sheets. 7
EPA EnviroAtlas Data: Map of Communities 8
Geodatabases-to-Shape Files 9 My Note: Sort by Size My Note: 0.5 GB HUC 12 Being Updated
FME Workbench: National Metrics Log File Starting translation... FME ( Build WIN64) FME_HOME is 'C:\Program Files\FME\' FME Database Edition (node locked-crc) Serial Number: 0 Temporary License: 31 days left. Machine host name is: BrandNiemann-PC LOTS MORE DETAILS….. Total Features Written 2,607,688 Translation was SUCCESSFUL with 8 warning(s) ( feature(s) output) FME Session Duration: 6 minutes 18.3 seconds. (CPU: 326.0s user, 47.7s system) END - ProcessID: 6016, peak process memory usage: kB, current process memory usage: kB Translation was SUCCESSFUL 10
FME Workbench: National Metrics GDB-to- SHP 11
Data Science Data Publication: MindTouch Knowledge Base 12 Data Science for EPA Big Data Analytics My Note: Use Google Chrome Find
Data Science Data Publication: Spreadsheet Knowledge Base 13 EPABigDataAnalytics.xlsx
EPA EnviroAtlas National & Community Inventory 14 xlscurrentdata.xls
Data Science Data Publication: Spotfire Cover Page 15 Content Analytics Web Player
Data Science Data Publication: IRM Strategic Plan 16 Content Analytics Web Player
Data Science Data Publication: IRM Strategic Plan Tables 17 PDF to Tables Enterprise Data Dictionary Web Player
Data Science Data Publication: EnviroAtlas Inventory National 18 National Layer Counts Web Player
Data Science Data Publication: EnviroAtlas Inventory Community 19 Community Layer Counts Web Player
Data Science Data Publication: EnviroAtlas Inventory NatureServe 20 SHAPE Length Versus SHAPE Area Acres per State SHAPE Area per State Web Player
Data Science Data Publication: EnviroAtlas Inventory Land Cover 21 Percent Wetland Versus PAGP Percent Wetland by HUC 12 Web Player
Conclusions and Recommendations The EPA EnviroAtlas Data are the most integrated databases EPA has for national and community ecosystems. The use of the ESRI proprietary GDB format limits the reuse of these data in open government data applications. The Safe Software FME Workbench was used to convert GDB-to-SHP formats for selected national and community files. A Data Science Data Publication of EPA Big Data Analytics was produced as an example of the new EPA Big Data Analytics Service in the EPA 5 year IRM Strategic Plan. There are EnviroAtlas Data for 50 Communities coming and lots of EPA Geospatial Data Sets that could be used for Big Data Analytics in Data Science Data Publications. 22
Exploratory Data Science on Even Bigger Data Process: Unzipped and Converted all National Metrics GDB-to-SHP with Safe FME Workbench (70 MB to 282 MB in 102 files of which 34 were SHP). Imported all 34 SHP (30 MB) at once into one Spotfire file that was 84 MB. Did Exploratory Data Analysis on them! Geometry is missing, but did not need it for this initially because have HUC Codes. Found current HUC 12 Geometry at USDA Geospatial Data Gateway (700 MB GDB ZIP) and Unzipped to 744 MB and converted GDB-to-SHP to 4.0 GB SHP! Imported to Spotfire and only 1.8 GB file! Safe FME Workbench Log file: Total Features Written: Translation was SUCCESSFUL with 0 warning(s) ( feature(s) output) FME Session Duration: 4 minutes 12.1 seconds. (CPU: 230.1s user, 6.3s system) END - ProcessID: 10120, peak process memory usage: kB, current process memory usage: kB Translation was SUCCESSFUL 23
Spotfire Data Tables and Relations 24 My Note: 35 Data Tables with All Their Many Columns of Numbers, Locations and Categories with BioMass (83,029 Rows by 10 Columns) Joined to HUC12 (100,493 Rows by 27 Columns) All in Memory!
Exploratory Data Science: BioMass by HUC
Exploratory Data Science: BioMass by HUC
Exploratory Data Science: Florida BenMap 27
Exploratory Data Science: Florida BG_Pop 28