David Kaeli PROTECT Center Northeastern University Boston, MA

Slides:



Advertisements
Similar presentations
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
Advertisements

Part III Solid Waste Engineering
PROTECT: Puerto Rico Testsite for Exploring Contamination Threats Project 3: Phthalate Exposure and Molecular Epidemiologic Markers of Preterm Birth among.
PROTECT: Puerto Rico Testsite for Exploring Contamination Threats Why modeling? The prediction of water resources and the assessment.
TRANSFORMING HEALTH CARE THROUGH RESEARCH AND EDUCATION 2012 Illinois Performance Excellence Bronze Award Leading Improvement Across the Continuum: Skills,
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Tools for Publishing Environmental Observations on the Internet Justin Berger, Undergraduate Researcher Jeff Horsburgh, Faculty Mentor David Tarboton,
David Kaeli PROTECT Center Northeastern University Boston, MA Developing a Data Management Model for Managing Diverse Data In Environmental Health Research.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.
Roger Miller, Arkansas Department of Environmental Quality Barry Jackson, USGS Arkansas Water Science Center ARKANSAS EXCHANGE NETWORK FOR GROUNDWATER-QUALITY.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
material assembled from the web pages at
CDC’s Preemie Act Activities Wanda Barfield, MD, MPH, FAAP Director, Division of Reproductive Health National Center for Chronic Disease Prevention and.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
UCLA Enterprise Directory Identity Management Infrastructure UC Enrollment Service Technical Conference October 16, 2007 Ying Ma
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
TechCon Food systems history… Agriculture has a 10,000 year history Farmers are estimated to be 38 to 45% of the global work force In the developing.
MDL Information Systems, Inc. Powering the Process of Invention Donna del Rey Director, Business Planning
HSPD-7 Critical Infrastructure Identification, Prioritization and Protection: designates EPA as the sector-specific lead agency for critical water infrastructure.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
National Center for Health Statistics (NCHS) Centers for Disease Control and Prevention.
National Institutes of Health U.S. Department of Health and Human Services William A. Suk, Ph.D., M.P.H. Director, Superfund Research Program Chief, Hazardous.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
© 2016 Chapter 6 Data Management Health Information Management Technology: An Applied Approach.
All-Payer Model Update
A Reusable Framework for Automated Record Creation and Population
2nd GEO Data Providers workshop (20-21 April 2017, Florence, Italy)
SNS COLLEGE OF TECHNOLOGY
NIEHS SRP P42 Research Center
Customer Support Strategic Pillars
Parallel Sessions: Pathways & Prediction
Joslynn Lee – Data Science Educator
MATLAB Distributed, and Other Toolboxes
REDCap Administration Organizational Analysis
An assessment framework for Intrusion Prevention System (IPS)
Evaluation of NCI Research Resources
Data and Big Data in MPI David Austin and Gerald Rys
An Overview of Data-PASS Shared Catalog
Database Systems: Design, Implementation, and Management Tenth Edition
Performance Management Overview
What is an attribute? How is it related to an entity?
Data challenges in the pharmaceutical industry
U.S. Department of Housing and Urban Development
Strategic Planning Update
A Guide to Shift’s Open Data ecosystem & Data workflow
Database Management System (DBMS)
Installation Restoration (IR) Collaboration Gateway
Jeannette T. Bensen, MS, PhD, Director IHSFC
Stop Data Wrangling, Start Transforming Data to Intelligence
An Industry Perspective Nicole Denjoy COCIR Secretary General
Data Warehousing and Data Mining
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
} northeastern.edu/protect
WIS Strategy – WIS 2.0 Submitted by: Matteo Dell’Acqua(CBS) (Doc 5b)
All-Payer Model Update
Enterprise Program Management Office
Metadata Construction in Collaborative Research Networks
Incorporating NVivo into a Large Qualitative Project
Leading Improvement Across the Continuum: Skills, Tools and Teams for Success January 2014.
EESC Public Hearing 30-Jan-2019
SAMPLE ONLY Dominion Health Center: Excellence in Medicaid Managed Care (or another defining message) Dominion Health Center is a community health center.
SAMPLE ONLY Dominion Health Center: Your Community Healthcare Home (or another defining message) Dominion Health Center is a community health center.
Local Authorities and Sustainable Energy
Jack G. Conrad, Thomson R&D
SAMPLE ONLY Dominion Health Center: Your Community Partner for Excellent Care (or another defining message) Dominion Health Center is a community health.
REACHnet: Research Action for Health Network
Presentation transcript:

David Kaeli PROTECT Center Northeastern University Boston, MA Developing a Data Management Model for Managing Diverse Data In Environmental Health Research David Kaeli PROTECT Center Northeastern University Boston, MA

Key Collaborators and Contributors Akram Alshawabkeh (NU) Liza Anzalota del Toro (UPRMC) Jose Cordero (UPRMC) EarthSoft® Kelly Ferguson (UMich) Reza Ghasemizadeh (NU) Roger Giese (NU) Rachel Grashow (NU) Lauren Johns (UMich) Ingrid Padilla (UPRM) Harshi Weerasinghe (NU) Leiming Yu (NU)

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) NIEHS SRP P42 Center since 2010 Cohort will include 1800 expectant mothers Key research questions: What is the contribution of environmental contaminants to preterm birth in Puerto Rico? Can we develop better strategies for detection and green remediation to minimize or prevent exposure to environmental contamination? Anticipated outcomes: Define the relationship between exposure to environmental contaminants and preterm birth Develop new technology for discovery, transport and exposure characterization, and green remediation of contaminants in karst systems Broader Impacts: Support environmental public health practice, policy and awareness around our theme

Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) A Diversity of Data

Challenges with Managing Diverse Environmental Health Data Managing data for a multi-university (5) distributed partnership Northeastern University University of Puerto Rico Medical Campus University of Puerto Rico at Mayaguez University of Michigan University of West Virginia Diverse data cleaning requirements A range of data privacy/security requirements Diverse expertise/needs Environmental engineers, biochemists, electrochemists, toxicologists, epidemiologists, biostatisticians, pediatricians, agronomist, hydrogeologists, and social scientists Background=trasfondo

Challenges with Managing Diverse Environmental Health Data Develop a common vocabulary across disciplines Provide a common data import/cleaning mechanism Collect/manage distributed data Provide for the security/privacy of human subject data Provide on-line distributed sharing Provide on-line relational query capabilities Provide export to a large number of formats Provide customized tools and access on a per-project basis Background=trasfondo

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

PROTECT Data Management and Modeling Core Responsible for the collection, storage, security and management of human subject, biological, survey data, and environmental data via a secure database system Leverages Dropbox, REDCap and Earthsoft commercial software Provides cleaning and secure/reliable storage Dropbox portal used for remote data transfer and data backup RedCap is used for data cleaning all Human Subject data (and other data in the future) EQuIS EDDs to perform data cleaning for all projects, as well as EQuIS Professional/Enterprise for managing data Enables PROTECT projects to cross-index data (based on subject ID, time and space-GIS) Provides data inventory, data analysis and mining Global data dictionary Comprehensive data inventory tools to track data collection status Data exports and analysis using EQuIS Enterprise

PROTECT Data By the Numbers (6/20/2015) Human Subject Core 3,193 total fields/participant Presently 15 different forms ~1.5M records Environmental Data Core 1048 wells (14 of them include water contaminant data) 35 springs (3 of them include water contaminant data) Field data 9 wells and 2 springs are sampled twice a year Tap water data: 34 homes 13 contaminants Targeted Biological Data Core 51 targeted chemicals * ~8 fields * # of participant 19 Phthalates and Phenols 18 Trace Metals 14 Pesticides Non-targeted Biological Data Core 5 fields, >1B data points in 6 urine samples Mass-to-charge values Data peaks

Data Producers EQuIS Schema Raw data Matlab SAS ArcGIS Data Sources Cleaned Data Cleaned Data EQuIS EDP Raw data Matlab SAS ArcGIS Data Sources Data Analysts

Data Consumers Education/Training Community Engagement Cleaned Data Cleaned Data EQuIS EZView Community Engagement Research Translation Research Aims And Project Questions Matlab SAS ArcGIS PROTECT Database

The PROTECT Database Environmental Human Subject and Sampling Epidemiologic/ Toxicologic Urine Blood Groundwater Tap Water Recruitment of Pregnant Women Hair Placenta Longitudinal Study 7 Clinics /3 Hospitals /In-House Interviews and Surveys Demographic, medical, residential, product use, food intake Medical Records Abstraction Biological Samples Blood, Urine, Hair, Placenta Water Quality Parameters Temp, pH, ions, cond, NO3-, etc. Phthalates DEHP, DEP, DBP) Chlorinated VOCs TCE, PCE, CT, TCM, etc. Non-Targeted Metabolites Phthalates Parabens & Phenols Organo-phosphates Metals Hormones Oxidative stress markers Inflammatory markers Other Non-Targeted Historical Historical Current Field Measurement Current Field Measurement We have amassed billions of data records which are potentially related, and we have the capability to analyze them efficiently!!

PROTECT Database: Framework Data Checking Data Query/Report Data Analysis/Modeling Third-party Interfaces Automated Workflow Web-based Widgets Legacy Contaminants Chlorinated Volatile Organic Compounds (CVOCs) Among the most frequently contaminants found at superfund sites Industrial use: solvents, dry/cleaning Emergent Contaminants Phthalates (DEHP, DEP, DBP) Present in commonly used products that make their way to landfills and illegal dumps DEHP – plasticizers DEP, DBP – solvents : Fragrances, Cosmetics, shampoos, etc. Other – phenols, PBA, Pharmaceuticals CVOCs and Phthalates have been associated with Preterm Birth1 (1Meeker et al. 2009; Meeker 2012; Sonnenfeld et al. 2001; Forand et al. 2012) Form Template First Screening EQuIS Professional EQuIS Enterprise

EQuIS Enterprise: Dashboards Default Dashboards Welcome Administration EDP Explorer EZView Notices Developing customized dashboards for each project

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Targeted Analysis example - Phthalates distributions over the time of the pregnancy 20 additional variables available, including hormone levels, oxidative stress and inflammation – biomarkers and chemical lists are expanding Courtesy of the Meeker Lab @ U. of Michigan

CVOC 1982 CVOC 1983 CVOC 1985 CVOC 1986 CVOC 1984 CVOC 2001 CVOC 1990 CVOC 1989 CVOC 2003 CVOC 1987-88 CVOC 2000 CVOC 2004 CVOC 1992 CVOC 1991 CVOC 1996 CVOC 2007 CVOC 2009 CVOC 1993 CVOC 2010 CVOC 2008 CVOC 1997 CVOC 2002 CVOC 1995 CVOC 2011 CVOC 1994 CVOC 1998 CVOC 2005 CVOC 2006 CVOC 1999 Courtesy of the Padilla Lab @ UPRM

Non-Targeted Analysis: Urine Data PROTECT performs mass spectrum analysis to identify non-targeted chemicals in urine samples from expectant mothers A mass spectrum is a mass-to-charge ratio plot used to identify chemicals present in samples Urine samples from Boston and Puerto Rico are analyzed 6 urine samples produce millions of (m/z, intensity) pairs Each urine sample extracted is separated into 240 droplets Laser transforms the analyte into gas phase ions, which are then registered as analyte and intensity pairs Data Core engaged in accelerating the processing of these pairs to identify chemicals

Example Data Analysis – Mass spectrometer analysis of urine samples 1176 urine sample files each containing ~130K (analyte, intensity) pairs, 4GB in total We want to identify compounds present in the samples Finding patterns in a 130K dimensional space is very difficult The result is impossible to visualize Limited processing capabilities of proprietary software that comes with the mass spectrometer ~20 minutes/sample on PCA processing Requires a very long research cycle when different scaling and weighting factors are needed for data processing

Summary of Classification Workflow 100X speedup in processing time

Principle Component Analysis Results Li et al., “Big Data Analysis on Puerto Rico Testsite for Exploring Contamination Threats”, ALLDATA, Barcelona, Spain, 2015.

General Characteristics of Human Subject Data High dimensionality Data of the first visit has more than 240 attributes Data of the second visit has more than 800 attributes Mixed sparsity Less than 30% of the data fields from the first visit are either undefined or set to their default values More than 70% of the data fields from the second visit are either undefined or set to their default values Diverse data types Numerical: working hours, gestational age, etc. Categorical: Race, type of job, etc.

Exploring Clustering Techniques for Complex Data K-prototypes Clustering [Huang98] Combines k-means (numerical) and k-modes (categorical) clustering techniques Need to fill in the missing data ROCK [Guha99] Hierarchical clustering method No need to fill in the missing data Subspace clustering for high dimensional categorical data [Gan04] Clustering based on the dense portion of the attributes (subspace) Alternative multiple spectral clustering [Niu08] Combine multiple non-redundant clustering solutions Leverages spectral clustering and dimensionality reduction

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Lessons Learned To Date Design in security and privacy from the beginning Develop a comprehensive data dictionary across the entire program Provide an extensible data schema Centralize data management to facilitate cross-project analysis Develop inventory tools to track progress Meet with data producers and consumers on a frequent basis Allow individual researchers to continue to work with their cleaned data Leverage/customize commercial tools, building partnerships and leveraging best practices Utilize the state-of-the-art in data analytics, especially given the rate of innovation in the field Do not forget to include the computational resources needed

Acknowledgements NCES - National Center for Environmental Assessment This project is supported by Grant Award Number P42ES017198 from the National Institute of Environmental Health Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences or the National Institutes of Health.