An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011 Kaizhi Tang, Ph.D., David Mihalcik, Thomas Wavering, Roger Xu Intelligent Automation Inc Prof. Stacey Harper, OSU Sue Pan, SAIC Sponsor Agency: Dr. Jeff Steevens, Army ERDC
Outline Motivation and proposed approach NEI modeling framework Design of NEIMiner information system NEIMiner
Motivation and proposed approach of NEIMiner NEED: To reduce the risk of nanomaterials in military use, NM environmental impact analysis requires a comprehensive NEI modeling framework, centralized NEI database, powerful model discovering tool and integrated model composition strategy. KEY COMPONENTS OF THE PROPOSED APPROACH Flexible data integration based on the ETL (Extract, Transform, Load) strategy of data warehouse. Integrated and collaborative data management utilizing modern content management system Optimized data mining process with many algorithms and parameters with huge computational burden Flexible model composition based on unified model abstraction reusing FRAMES DELIVERABLES Conceptual framework of NEI analysis Collaborative NEI information system with model discovery and composition capability VALUE TO THE CUSTOMER /TRANSITION CUSTOMER Environmental impact estimation tool for nanomaterials Easy access to large amount of NEI data in a centralized data warehouse and the available model generation tool Potentially useful evaluation models of NEI
Collaboratory of Structural Nanobiology NEI Data NEI Data Mining Models Scope of NEI Modeling
NEIMiner System Architecture NEI Data NEI Data Mining Models
Available NEI Data and Schemas Nanomaterial-Biological Interactions Knowledgebase – Cancer Nanotechnology Laboratory portal (caNanoLab) – NCI, ICON: International Council on Nanotechnology – Rice University, Nano-Tab – tab-delimited spreadsheet type based on EBI and ISA-TAB NanoParticle Ontology(NPO) – Implemented in OWL Most complete characterization capture Largest number of publications, limited characterization capture Wide range of characterization and health impact data Most complete characterization capture Largest number of publications, limited characterization capture
Other Data and Schemas OECD Database on Research into Safety of Manufactured Nanomaterials – National Institute for Occupational Safety and Health (NIOSH) – SAFENANO - Institute of Occupational Health (UK) – University of Wisconsin - Madison: Nanoscale Science and Engineering Center – National Reference Center for Bioethics Literature - Georgetown University, Kennedy Institute of Ethics – Nanomedicine Research Portal – Center on Nanotechnology and Society (Chicago-Kent College of Law in the Illinois Institute of Technology) –
Data Extraction Methods Data extraction via web services – Example: caNanoLab Data extraction via web scraping – Examples: ICON, NBI – Approaches Human copy-and-paste HTTP programming Text grepping and regular expression matching HTML parsers
Design philosophy of NEI data Warehouse Data Warehouse – Centralized data from multiple data sources for analysis => multiple nano risk related data sources with different formats – Consists of an ETL tool, a Database, a Reporting tool, Data Modeling => tools useful for NM data integration and mining – Subject oriented data organization => risk assessment for nano materials – Multi-dimensional => various nanomaterial properties – Star schema => extendible schema design
NEI Model Discovery Physical properties Material Type Particle size distribution PDI Shape Structure Chemical properties Surface reactivity Surface charge Water solubility Exposure and Study scenario Duration Continuity Exposure route Number of nanoparticles Number of ligands Biological Properties Species, age, gender, weight Environmental ecosystem response Fate and transport Bioavailability and uptake Biomagnificiation Biological response Genomic response Cell death Correlation? Prediction?
Interesting Mining Problems and Solutions How to handle missing data – Median on numerical values – Median-frequency categories – Classification or regression using existing data How to determine attribute significance – Compare gain ratio for classification – Compare relief ratio for numerical prediction How to select algorithms and their parameters for training – Meta-optimization on algorithms and parameters How to split the data sets for high-quality models – Comparing various splitting strategies – Clustering as a preprocessing step
Demonstration of NEIMiner 12