Download presentation
Presentation is loading. Please wait.
1
DIMACS Working Group on Data Mining and Epidemiology
2
What are the challenges for mathematical scientists in the defense against disease? This question led DIMACS, the Center for Discrete Mathematics and Theoretical Computer Science, to launch a “special focus” on this topic.
3
DIMACS Special Focus on Computational and Mathematical Epidemiology 2002-2005 Anthrax
4
Post-September 11 events soon led to an emphasis on bioterrorism. smallpox
5
Working Groups
6
Working Groups Continued Interdisciplinary, international groups of researchers. Come together at DIMACS. Informal presentations, lots of time for discussion. Emphasis on collaboration. Return as a full group or in subgroups to pursue problems/approaches identified in first meeting. By invitation; but contact the organizer. Junior researchers welcomed. Nominate them.
7
Working Groups WG’s on Large Data Sets: Adverse Event/Disease Reporting, Surveillance & Analysis. Spin-off: Health Care Data Privacy and Confidentiality Data Mining and Epidemiology.
8
WG’s on Analogies between Computers and Humans: Analogies between Computer Viruses/Immune Systems and Human Viruses/Immune Systems Distributed Computing, Social Networks, and Disease Spread Processes
9
WG’s on Methods/Tools of TCS Phylogenetic Trees and Rapidly Evolving Diseases Order-Theoretic Aspects of Epidemiology
10
WG’s on Computational Methods for Analyzing Large Models for Spread/Control of Disease Spatio-temporal and Network Modeling of Diseases Methodologies for Comparing Vaccination Strategies
11
WG’s on Mathematical Sciences Methodologies Mathematical Models and Defense Against Bioterrorism Predictive Methodologies for Infectious Diseases Statistical, Mathematical, and Modeling Issues in the Analysis of Marine Diseases
12
Data Mining and Epidemiology –Interest sparked in part by availability of large and disparate computerized databases on subjects relating to disease
13
Early warning is critical in public health This is a crucial factor underlying government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack Sensors will be a source of huge amounts of data The BASIS System
14
The DIMACS Bioterrorism Sensor Location Project
15
Data Mining and Epidemiology: Some Research Issues:
16
1. Streaming Data Analysis: When you only have one shot at the data Widely used to detect trends and sound alarms in applications in telecommunications and finance AT&T uses this to detect fraudulent use of credit cards or impending billing defaults Columbia has developed methods for detecting fraudulent behavior in financial systems Uses algorithms based in TCS Needs modification to apply to disease detection
17
Research Issues: Modify methods of data collection, transmission, processing, and visualization Explore use of decision trees, vector-space methods, Bayesian and neural nets How are the results of monitoring systems best reported and visualized? To what extent can they incur fast and safe automated responses? How are relevant queries best expressed, giving the user sufficient power while implicitly restraining him/her from incurring unwanted computational overhead?
18
2. Cluster Analysis Used to extract patterns from complex data Application of traditional clustering algorithms hindered by extreme heterogeneity of the data Newer clustering methods based on TCS for clustering heterogeneous data need to be modified for infectious disease and bioterrorist applications.
19
3. Visualization Large data sets are sometimes best understood by visualizing them.
20
3. Visualization (continued) Sheer data sizes require new visualization regimes, which require suitable external memory data structures to reorganize tabular data to facilitate access, usage, and analysis. Visualization algorithms become harder when data arises from various sources and each source contains only partial information.
21
4. Data Cleaning Disease detection problem: Very “dirty” data:
22
4. Data Cleaning (continued) Very “dirty” data due to –manual entry –lack of uniform standards for content and formats –data duplication –measurement errors TCS-based methods of data cleaning –duplicate removal –“merge purge” –automated detection
23
5. Dealing with “Natural Language” Reports Devise effective methods for translating natural language input into formats suitable for analysis. Develop computationally efficient methods to provide automated responses consisting of follow- up questions. Develop semi-automatic systems to generate queries based on dynamically changing data.
24
6. Cryptography and Security Devise effective methods for protecting privacy of individuals about whom data is provided to biosurveillance teams -- data from emergency dept. visits, doctor visits, prescriptions Develop ways to share information between databases of intelligence agencies while protecting privacy?
25
6. Cryptography and Security (continued) Specifically: How can we make a simultaneous query to two datasets without compromising information in those data sets? (E.g., is individual xx included in both sets?) Issues include: –insuring accuracy and reliability of responses –authentication of queries –policies for access control and authorization
26
7. Spatio-Temporal Mining of Sensor Data Sensors provide observations of the state of the world localized in space and time. Finding trends in data from individual sensors: time series data mining. Detecting general correlations in multiple time series of observations. This has been studied in statistics, database theory, knowledge discovery, data mining. Complications: proximity relationships based on geography; complex chronological effects.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.