David Kaeli PROTECT Center Northeastern University Boston, MA Developing a Data Management Model for Managing Diverse Data In Environmental Health Research.

Slides:



Advertisements
Similar presentations
CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
Advertisements

C6 Databases.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
HP Quality Center Overview.
Part III Solid Waste Engineering
PROTECT: Puerto Rico Testsite for Exploring Contamination Threats Project 3: Phthalate Exposure and Molecular Epidemiologic Markers of Preterm Birth among.
PROTECT: Puerto Rico Testsite for Exploring Contamination Threats Why modeling? The prediction of water resources and the assessment.
Frank Yu Australian Bureau of Statistics Unstructured Data 1.
Tom Sheridan IT Director Gas Technology Institute (GTI)
TRANSFORMING HEALTH CARE THROUGH RESEARCH AND EDUCATION 2012 Illinois Performance Excellence Bronze Award Leading Improvement Across the Continuum: Skills,
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Managing Data Resources
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Bindley Bioscience Center Vision: Nurture interactive communication and interdisciplinary discovery with flexible laboratory project spaces and an open.
1 MAIS & ITSS FY09 Priorities Joint UL Meeting October 27, 2008.
Tools for Publishing Environmental Observations on the Internet Justin Berger, Undergraduate Researcher Jeff Horsburgh, Faculty Mentor David Tarboton,
1 Effectively Managing Global Engineering Licenses Kimberley A. Dillman IT Solution Architect – Engineering Delphi Corporation
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Understanding Data Analytics and Data Mining Introduction.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.
ACCELERATING CLINICAL AND TRANSLATIONAL RESEARCH A simple, flexible tool for inexpensively building secure data capture systems Andy.
Roger Miller, Arkansas Department of Environmental Quality Barry Jackson, USGS Arkansas Water Science Center ARKANSAS EXCHANGE NETWORK FOR GROUNDWATER-QUALITY.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Prepared for IAC Scott Baily, Interim Director of ACNS August 13, 2008.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
material assembled from the web pages at
CDC’s Preemie Act Activities Wanda Barfield, MD, MPH, FAAP Director, Division of Reproductive Health National Center for Chronic Disease Prevention and.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
UCLA Enterprise Directory Identity Management Infrastructure UC Enrollment Service Technical Conference October 16, 2007 Ying Ma
Japan Environment and Children’s Study (Jecs). Jecs Longitudinal birth cohort study enrolling 100,000 pregnant women and following up their children until.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
1 PD3 P rofessional D evelopment D ecisions Using D ata.
Facilitate Scientific Data Sharing by Sharing Informatics Tools and Standards Belinda Seto and James Luo National Institute of Biomedical Imaging and Bioengineering.
Survey of alternative uses of ACTIVE data, review of completed work and lessons learned Friday Harbor Advanced Psychometrics Workshop June 9-13, 2014 Presenter:
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Issues and Challenges for Integrated Surveillance Systems Daniel M. Sosin, MD, MPH Division of Public Health Surveillance and Informatics Epidemiology.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The HMO Research Network (HMORN) is a well established alliance of 18 research departments in the United States and Israel. Since 1994, the HMORN has conducted.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
TechCon Food systems history… Agriculture has a 10,000 year history Farmers are estimated to be 38 to 45% of the global work force In the developing.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
The User Perspective Michelle Osmond. The Research Challenge Molecular biology, biochemistry, plant biology, genetics, toxicology, chemistry, and more.
California Water Plan Update Advisory Committee Meeting January 20, 2005.
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
Clinical and Translational Science Awards (CTSAs) Future Directions and Opportunities for ACNP investigators.
Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,
HSPD-7 Critical Infrastructure Identification, Prioritization and Protection: designates EPA as the sector-specific lead agency for critical water infrastructure.
APHA, November 7, 2007 Amy Friedman Milanovich, MPH Head of Training and Dissemination Center for Managing Chronic Disease University of Michigan Using.
David Kaeli PROTECT Center Northeastern University Boston, MA
Organizations Are Embracing New Opportunities
NIEHS SRP P42 Research Center
Joslynn Lee – Data Science Educator
REDCap Administration Organizational Analysis
Data and Big Data in MPI David Austin and Gerald Rys
Operationalize your data lake Accelerate business insight
Database Management System (DBMS)
N. Capp, E. Krome, I. Obeid and J. Picone
Jeannette T. Bensen, MS, PhD, Director IHSFC
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
} northeastern.edu/protect
Metadata Construction in Collaborative Research Networks
Bird of Feather Session
Presentation transcript:

David Kaeli PROTECT Center Northeastern University Boston, MA Developing a Data Management Model for Managing Diverse Data In Environmental Health Research

Key Collaborators and Contributors Akram Alshawabkeh (NU) Liza Anzalota del Toro (UPRMC) Jose Cordero (UPRMC) EarthSoft® Kelly Ferguson (UMich) Reza Ghasemizadeh (NU) Roger Giese (NU) Rachel Grashow (NU) Lauren Johns (UMich) Ingrid Padilla (UPRM) Harshi Weerasinghe (NU) Leiming Yu (NU)

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) NIEHS SRP P42 Center since 2010 Cohort will include 1800 expectant mothers Key research questions: What is the contribution of environmental contaminants to preterm birth in Puerto Rico? Can we develop better strategies for detection and green remediation to minimize or prevent exposure to environmental contamination? Anticipated outcomes: Define the relationship between exposure to environmental contaminants and preterm birth Develop new technology for discovery, transport and exposure characterization, and green remediation of contaminants in karst systems Broader Impacts: Support environmental public health practice, policy and awareness around our theme

Puerto Rico Testsite for Exploring Contamination Threats (PROTECT) A Diversity of Data

Challenges with Managing Diverse Environmental Health Data Managing data for a multi-university (5) distributed partnership Northeastern University University of Puerto Rico Medical Campus University of Puerto Rico at Mayaguez University of Michigan University of West Virginia Diverse data cleaning requirements A range of data privacy/security requirements Diverse expertise/needs Environmental engineers, biochemists, electrochemists, toxicologists, epidemiologists, biostatisticians, pediatricians, agronomist, hydrogeologists, and social scientists

Challenges with Managing Diverse Environmental Health Data Challenges Develop a common vocabulary across disciplines Provide a common data import/cleaning mechanism Collect/manage distributed data Provide for the security/privacy of human subject data Provide on-line distributed sharing Provide on-line relational query capabilities Provide export to a large number of formats Provide customized tools and access on a per-project basis

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Responsible for the collection, storage, security and management of human subject, biological, survey data, and environmental data via a secure database system Leverages Dropbox, REDCap and Earthsoft commercial software Provides cleaning and secure/reliable storage Dropbox portal used for remote data transfer and data backup RedCap is used for data cleaning all Human Subject data (and other data in the future) EQuIS EDDs to perform data cleaning for all projects, as well as EQuIS Professional/Enterprise for managing data Enables PROTECT projects to cross-index data (based on subject ID, time and space-GIS) Provides data inventory, data analysis and mining Global data dictionary Comprehensive data inventory tools to track data collection status Data exports and analysis using EQuIS Enterprise PROTECT Data Management and Modeling Core

PROTECT Data By the Numbers (6/20/2015) Human Subject Core 3,193 total fields/participant Presently 15 different forms ~1.5M records Environmental Data Core 1048 wells (14 of them include water contaminant data) 35 springs (3 of them include water contaminant data) Field data 9 wells and 2 springs are sampled twice a year Tap water data: 34 homes 13 contaminants Targeted Biological Data Core 51 targeted chemicals * ~8 fields * # of participant 19 Phthalates and Phenols 18 Trace Metals 14 Pesticides Non-targeted Biological Data Core 5 fields, >1B data points in 6 urine samples Mass-to-charge values Data peaks

Data Sources Data Analysts Raw data EQuIS Schema ArcGIS Matlab SAS Cleaned Data Cleaned Data EQuIS EDP Data Producers

ArcGIS Matlab SAS Cleaned Data Cleaned Data EQuIS EZView Community Engagement Research Translation Education/Training Research Aims And Project Questions PROTECT Database Data Consumers

The PROTECT Database Environmental Human Subject and Sampling Epidemiologic/ Toxicologic GroundwaterTap Water Water Quality Parameters Temp, pH, ions, cond, NO 3 -, etc. Phthalates DEHP, DEP, DBP) Chlorinated VOCs TCE, PCE, CT, TCM, etc. Non-Targeted Historical Current Field Measurement Longitudinal Study 7 Clinics /3 Hospitals /In-House Interviews and Surveys Demographic, medical, residential, product use, food intake Medical Records Abstraction Biological Samples Blood, Urine, Hair, Placenta Recruitment of Pregnant Women HairPlacenta UrineBlood Metabolites Phthalates Parabens & Phenols Organo-phosphates Metals Hormones Oxidative stress markers Inflammatory markers Other Non-Targeted We have amassed billions of data records which are potentially related, and we have the capability to analyze them efficiently!!

PROTECT Database: Framework Data Checking Data Query/Report Data Analysis/Modeling Third-party Interfaces Automated Workflow Web-based Widgets  Form Template  First Screening  EQuIS Professional  EQuIS Enterprise

EQuIS Enterprise: Dashboards Developing customized dashboards for each project Default Dashboards  Welcome  Administration  EDP  Explorer  EZView  Notices

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Targeted Analysis example - Phthalates distributions over the time of the pregnancy 20 additional variables available, including hormone levels, oxidative stress and inflammation – biomarkers and chemical lists are expanding Courtesy of the Meeker U. of Michigan

CVOC 1982 CVOC 1983 CVOC 1984CVOC 1985 CVOC 1986 CVOC CVOC 1989 CVOC 1990 CVOC 1991 CVOC 1992 CVOC 1993 CVOC 1994 CVOC 1995 CVOC 1996 CVOC 1997 CVOC 1998 CVOC 1999 CVOC 2000 CVOC 2001 CVOC 2002 CVOC 2003 CVOC 2004 CVOC 2005 CVOC 2006 CVOC 2007 CVOC 2008 CVOC 2009CVOC 2010 CVOC 2011 Courtesy of the Padilla UPRM

PROTECT performs mass spectrum analysis to identify non-targeted chemicals in urine samples from expectant mothers A mass spectrum is a mass-to-charge ratio plot used to identify chemicals present in samples Urine samples from Boston and Puerto Rico are analyzed 6 urine samples produce millions of (m/z, intensity) pairs Each urine sample extracted is separated into 240 droplets Laser transforms the analyte into gas phase ions, which are then registered as analyte and intensity pairs Data Core engaged in accelerating the processing of these pairs to identify chemicals Non-Targeted Analysis: Urine Data

1176 urine sample files each containing ~130K (analyte, intensity) pairs, 4GB in total We want to identify compounds present in the samples Finding patterns in a 130K dimensional space is very difficult The result is impossible to visualize Limited processing capabilities of proprietary software that comes with the mass spectrometer ~20 minutes/sample on PCA processing Requires a very long research cycle when different scaling and weighting factors are needed for data processing Example Data Analysis – Mass spectrometer analysis of urine samples

Summary of Classification Workflow 100X speedup in processing time

Principle Component Analysis Results Li et al., “Big Data Analysis on Puerto Rico Testsite for Exploring Contamination Threats”, ALLDATA, Barcelona, Spain, 2015.

General Characteristics of Human Subject Data High dimensionality Data of the first visit has more than 240 attributes Data of the second visit has more than 800 attributes Mixed sparsity Less than 30% of the data fields from the first visit are either undefined or set to their default values More than 70% of the data fields from the second visit are either undefined or set to their default values Diverse data types Numerical: working hours, gestational age, etc. Categorical: Race, type of job, etc.

Exploring Clustering Techniques for Complex Data K-prototypes Clustering [Huang98] Combines k-means (numerical) and k-modes (categorical) clustering techniques Need to fill in the missing data ROCK [Guha99] Hierarchical clustering method No need to fill in the missing data Subspace clustering for high dimensional categorical data [Gan04] Clustering based on the dense portion of the attributes (subspace) No need to fill in the missing data Alternative multiple spectral clustering [Niu08] Combine multiple non-redundant clustering solutions Leverages spectral clustering and dimensionality reduction

Organization of this Presentation Challenges of managing diverse environmental health data to support a distributed Center Overview of the PROTECT Data Management and Modeling Core Data Analytics and Machine Learning Lessons learned

Lessons Learned To Date Design in security and privacy from the beginning Develop a comprehensive data dictionary across the entire program Provide an extensible data schema Centralize data management to facilitate cross-project analysis Develop inventory tools to track progress Meet with data producers and consumers on a frequent basis Allow individual researchers to continue to work with their cleaned data Leverage/customize commercial tools, building partnerships and leveraging best practices Utilize the state-of-the-art in data analytics, especially given the rate of innovation in the field Do not forget to include the computational resources needed

Acknowledgements This project is supported by Grant Award Number P42ES from the National Institute of Environmental Health Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences or the National Institutes of Health.