Geoinformatics Seminar G. P. Patil March 2003

Slides:

Advertisements

Similar presentations

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.

Decision Making: An Introduction 1. 2 Decision Making Decision Making is a process of choosing among two or more alternative courses of action for the.

Raster Based GIS Analysis

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Geographic Information Systems

Development of Empirical Models From Process Data

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CHAPTER 6 Statistical Analysis of Experimental Data

Lecture II-2: Probability Review

Geographic Information Science

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.

RESEARCH A systematic quest for undiscovered truth A way of thinking

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Extent and Mask Extent of original data Extent of analysis area Mask – areas of interest Remember all rasters are rectangles.

INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-

What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”

Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.

Spatial Data Models Geography is concerned with many aspects of our environment. From a GIS perspective, we can identify two aspects which are of particular.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

NIEHS G. P. Patil. This report is very disappointing. What kind of software are you using?

Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.

1 Forum for Interdisciplinary Mathematics Patna, India G. P. Patil December 2010.

Project Geoinformatic Surveillance NSF DGP Grant G. P. Patil, Penn State, PI EPA: Watershed Characterization and Prioritization PADOH: Disease Clusters.

1 Surveillance GeoInformatics Hotspot Detection, Prioritization, and Early Warning G. P. Patil December 2004 – January 2005.

1 NJ DHSS CES SEER G. P. Patil January 17, This report is very disappointing. What kind of software are you using?

Geographic and Network Surveillance for Arbitrarily Shaped Hotspots Overview Geospatial Surveillance Upper Level Set Scan Statistic System Spatial-Temporal.

1 Multi-criterion Ranking and Poset Prioritization G. P. Patil December 2004 – January 2005.

1 Spatial Temporal Surveillance. 2 3 Geographic Surveillance and Hotspot Detection for Homeland Security: Cyber Security and Computer Network Diagnostics.

CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.

Research Design

1 Fukuoka Conference, Japan G. P. Patil November 2005.

4.6.1 Upper Echelons of Surfaces

Howard Community College

APPLICATIONS FOR STRATEGIC ASSESSMENT,

Big data classification using neural network

Virtual University of Pakistan

Spatially Constrained Clustering and Upper Level Set Scan Hotspot Detection in Surveillance GeoInformatics G.P.Patil, Penn State University Reza Modarres,

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Chapter 7. Classification and Prediction

Introduction to Spatial Statistical Analysis

INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE

INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM

Introduction to Functions of Several Variables

We propose a method which can be used to reduce high dimensional data sets into simplicial complexes with far fewer points which can capture topological.

EPA Presentation March 13,2003 G. P. Patil

Lecture Notes for Chapter 2 Introduction to Data Mining

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Quantifying Scale and Pattern Lecture 7 February 15, 2005

Summary of Prev. Lecture

NSF Digital Government surveillance geoinformatics project, federal agency partnership and national applications for digital governance.

Mean Shift Segmentation

Introduction to Statistics

REMOTE SENSING Multispectral Image Classification

Cartographic and GIS Data Structures

Indexing and Hashing Basic Concepts Ordered Indices

Geographic and Network Surveillance for Arbitrarily Shaped Hotspots

Visualization and Analysis of Air Pollution in US East Coast Cities

EE513 Audio Signals and Systems

Causal Models Lecture 12.

Lecture # 2 MATHEMATICAL STATISTICS

Parametric Methods Berlin Chen, 2005 References:

Albany New York (1) G. P. Patil

Data Pre-processing Lecture Notes for Chapter 2

Presentation transcript:

Geoinformatics Seminar G. P. Patil March 2003

This report is very disappointing. What kind of software are you using?

Stone-Age Space-Age Syndrome Stone-age data Space-age data Stone-age analysis Space-age analysis

PROJECT GEOINFORMATIC SURVEILLANCE DECISION SUPPORT SYSTEM Geographic and Network Surveillance for Arbitrary Shaped Hotspots -Next Generation of Geographic Hotspot Detection, Prioritization, and Early Warning System-

Geographical Surveillance Discrete response Hotspot detection and upper level set scan Hotspot delineation and hot-spot rating Multiple hotspot detection and delineation Hotspot prioritization and poset ranking Space-time detection and early warning Continuous response User friendly software and downloadable website

Areas of Application Biodiversity, species-rich, and species-poor areas Water resources at watershed scales Power lines and their effects Networks of water distribution systems, subway systems, and road transport systems Urban and regional planning Disease epidemiology Medical imaging Reconnaissance Astronomy Archaeology

Spatial Scan Statistic Limitations and Needs Circles capture only compactly shaped clusters Want to identify clusters of arbitrary shape Circles handle only synoptic (tessellated ) data Want to handle data on a network Circles provide point estimate of hotspot Want to assess estimation uncertainty (hotspot confidence set)

Hotspot Detection Innovation Upper Level Set Scan Statistic Attractive Features Identifies arbitrarily shaped clusters Data-adaptive zonation of candidate hotspots Applicable to data on a network Provides both a point estimate as well as a confidence set for the hotspot Uses hotspot-membership rating to map hotspot boundary uncertainty Computationally efficient Applicable to both discrete and continuous syndromic responses Identifies arbitrarily shaped clusters in the spatial-temporal domain Provides a typology of space-time hotspots with discriminatory surveillance potential

Prioritization Innovation Partial Order Set Ranking We also present a prioritization innovation. It lies in the ability for prioritization and ranking of hotspots based on multiple indicator and stakeholder criteria without having to integrate indicators into an index, using Hasse diagrams and partial order sets. This leads us to early warning systems, and also to the selection of investigational areas.

Syndromic Surveillance Symptoms of disease such as diarrhea, respiratory problems, headache, etc Earlier reporting than diagnosed disease Less specific, more noise

The Spatial Scan Statistic Move a circular window across the map. Use a variable circle radius, from zero up to a maximum where 50 percent of the population is included.

A small sample of the circles used

For each circle: Compare Circles: Inference: Obtain actual and expected number of cases inside and outside the circle. Calculate Likelihood Function Compare Circles: Pick circle with maximum likelihood. This is the most likely cluster, i.e., the cluster least likely to have occurred by chance Inference: Generate random replicas of the data set under the null-hypothesis of no clusters (Monte Carlo sampling) Compare most likely clusters in real and random data sets (Likelihood ratio test)

Zonal p-Values -- 1 Zonal p-values depend upon: zonal intensity zonal sample size Example: 1 case among 5 persons is not significant 1 thousand cases among 5 thousand persons might well significant

Zonal p-Values -- 2 Contour of constant p-values (p = .05) Sample Size Zonal Intensity Contour of constant p-values (p = .05) Zone 1 Zone 2 Zone 3 Zone 1 is not significant (high intensity but small sample size) Zone 2 is significant even though its intensity is smaller than that of Zone 1 Zone 3 is not significant due to low intensity

Spatial Scan Statistic: Properties Adjusts for inhomogeneous population density. Simultaneously tests for clusters of any size and any location, by using circular windows with continuously variable radius. Accounts for multiple testing. Possibility to include confounding variables, such as age, sex or socio-economic variables. Aggregated or non-aggregated data (states, counties, census tracts, block groups, households, individuals).

Detecting Emerging Clusters Instead of a circular window in two dimensions, we use a cylindrical window in three dimensions. The base of the cylinder represents space, while the height represents time. The cylinder is flexible in its circular base and starting date, but we only consider those cylinders that reach all the way to the end of the study period. Hence, we are only considering ‘alive’ clusters.

West Nile Virus Surveillance in New York City 2000 Data: Simulation/Testing of Prospective Surveillance System 2001 Data: Real Time Implementation of Daily Prospective Surveillance

2000 Data - Dead birds reported by the public - Simulation of a daily prospective surveillance system - Start date: June 1, 2000.

Major epicenter on Staten Island Dead bird surveillance system: June 14 Positive bird report: July 16 (coll. July 5) Positive mosquito trap: July 24 (coll. July 7) Human case report: July 28 (onset July 20)

Hospital Emergency Admissions in New York City Hospital emergency admissions data from a majority of New York City hospitals. At midnight, hospitals report last 24 hour of data to New York City Department of Health A spatial scan statistic analysis is performed every morning If an alarm, a local investigation is conducted

Bird richness in the hexagons.

Statewide echelon map based on EMAP hexagons Statewide echelon map based on EMAP hexagons. The 4-digit number in each hexagon is the EPA-EMAP identifier, while the number below is species richness.

Aggregation-Resolution Issues

Aggregation-Resolution Issues

Ridge and Valley Hotspots Hotspots within Hotspots Ridge and Valley Hotspots Ridge and Valley hotspots within hotspots.

Candidate Zones for Hotspots Goal: Identify geographic zone(s) in which a response is significantly elevated relative to the rest of a region A list of candidate zones Z is specified a priori This list becomes part of the parameter space and the zone must be estimated from within this list Each candidate zone should generally be spatially connected, e.g., a union of contiguous spatial units or cells Longer lists of candidate zones are usually preferable Expanding circles or ellipses about specified centers are a common method of generating the list

Poor Hotspot Delineation by Circular Zones Circular zone approximations Circular zones may represent single hotspot as multiple hotspots

Circular spatial scan statistic zonation Cholera outbreak along a river flood plain • Small circles miss much of the outbreak Large circles include many unwanted cells

Urban Heat Islands Use of aircraft and spacecraft remote sensing data on a local scale to help quantify and map urban sprawl, land use change, urban heat island, air quality, and their impact on human health (e.g. pediatric asthma)

Baltimore Asthma Project Interdisciplinary Analysis of Childhood Asthma in Baltimore, MD The impact of asthma is escalating within the U.S. and children are particularly impacted with hospitalization increasing 74% since 1979. This study is investigating climate and environmental links to asthma in Baltimore, Maryland, a city in the top quintile for children’s asthma in the U.S. Asthma Assessment from GIS techniques Inner Harbor Time Aerosol Size Collect and integrate in-situ measurements, remotely sensed measurements and clinical records that have possible relationships to the occurrence of asthma in the Baltimore, Maryland region. Identify key trigger variables from the data to predict asthma occurrence on a spatial and temporal basis. Organize a multidisciplinary team to assist in model design, analysis and interpretation of model results. Develop tools for integrating, accessing and manipulating relevant health and remote sensing data and make these tools available to the scientific and health communities. Partners: Baltimore City Health Department Baltimore City School System Baltimore City Planning Council, Mayor’s Office State of Maryland Department of the Environment State of Maryland Department of Health and Human Services University of Maryland

Crime Rate Hotspots

Cylindrical space-time scan statistic zonation Outbreak expanding in time Small cylinders miss much of the outbreak Large cylinders include many unwanted cells Space Time

Some Space-Time Hotspots and Their Cylindrical Approximations Cylindrical approximation sees single hotspot as multiple hotspots Space Time

ULS Candidate Zones Question: Are there data-driven (rather than a priori) ways of selecting the list of candidate zones? Motivation for the question: A human being can look at a map and quickly determine a reasonable set of candidate zones and eliminate many other zones as obviously uninteresting. Can the computer do the same thing? A data-driven proposal: Candidate zones are the connected components of the upper level sets of the response surface. The candidate zones have a tree structure (echelon tree is a subtree), which may assist in automated detection of multiple, but geographically separate, elevated zones. Null distribution: If the list is data-driven (i.e., random), its variability must be accounted for in the null distribution. A new list must be developed for each simulated data set.

ULS Scan Statistic Data-adaptive approach to reduced parameter space 0 Zones in 0 are connected components of upper level sets of the empirical intensity function Ga = Ya / Aa Upper level set (ULS) at level g consists of all cells a where Ga  g Upper level sets may be disconnected. Connected components are the candidate zones in 0 These connected components form a rooted tree under set inclusion. Root node = entire region R Leaf nodes = local maxima of empirical intensity surface Junction nodes occur when connectivity of ULS changes with falling intensity level

Upper Level Set (ULS) of Intensity Surface Hotspot zones at level g (Connected Components of upper level set)

Changing Connectivity of ULS as Level Drops

ULS Connectivity Tree Schematic intensity “surface” N.B. Intensity surface is cellular (piece-wise constant), with only finitely many levels A, B, C are junction nodes where multiple zones coalesce into a single zone A B C

ULS Connectivity Tree -1 Ingredients: Tessellation of a geographic region: Intensity value G on each cell. Determines a cellular (piece-wise constant) surface with G as elevation. Imagine surface initially inundated with water Water evaporates gradually exposing the surface which appears as islands in the sea How does connectivity (number of connected components) of the exposed surface change with falling water level? a c b k d e f g h i j a, b, c, … are cell labels

ULS Connectivity Tree -2 Think of the tessellated surface as a landform Initially the entire surface is under water As the water level recedes, more and more of the landform is exposed At each water level, cells are colored as follows: Green for previously exposed cells (green = vegetated) Yellow for newly exposed cells (yellow = sandy beach) Blue for unexposed cells (blue = under water) For each newly exposed cell, one of three things happens: New island emerges. Cell is a local maximum. Morse index=2. Connectivity increases. Existing island increases in size. Cell is not a critical point. Connectivity unchanged. Two (or more) islands are joined. Cell is a saddle point Morse index=1. Connectivity decreases.

ULS Connectivity Tree -3 Newly exposed island ULS Tree g j a f d h b c a i e k Island grows a g j f d h b,c b c a i e k

ULS Connectivity Tree -4 Second island appears ULS Tree a g j d f d h b,c b c a New leaf node (local maximum) i e k Both islands grow a g b,c j d f d h e b f,g c a i e k

ULS Connectivity Tree -5 ULS Tree Islands join – saddle point a b,c g j d f d h e f,g b c a i e h k Junction node a Exposed land grows b,c g d j f d h e f,g b a c i e h k Root node i,j,k

Comparison of Tree-Structured and Circle-Based SATScan Agreement/Disagreement regarding hotspot locus Comparative plausibility and accuracy of hotspot delineation Execution time and computer efficiency

Confidence Region on ULS Tree A hotspot confidence set with two connected components is shown on the ULS tree. The connected components correspond to different hotspot loci while the nodes within a connected component correspond to different delineations of that hotspot – all at the appropriate confidence level.

Estimation Uncertainty in Hotspot Delineation Outer envelope MLE Inner envelope

Hotspot Delineation and Hotspot Rating Determine a confidence set for the hotspot Each member of the confidence set is a zone which is a statistically plausible delineation of the hotspot at specified confidence Confidence set lets us rate individual cells a for hotspot membership Rating for cell a is percentage of zones in confidence set that contain a. (More generally, use weighted proportion.) Map of cell ratings: Inner envelope = cells with 100% rating Outer envelope = cells with positive rating

Ranking Possible Disease Clusters in the State of New York Data Matrix

Regions of comparability and incomparability for the inherent importance ordering of hotspots. Hotspots form a scatterplot in indicator space and each hotspot partitions indicator space into four quadrants

Hotspot Prioritization and Poset Ranking Multiple hotspots with intensities significantly elevated relative to the rest of the region Ranking based on likelihood values, and additional attributes: raw intensity values, socio-economic and demographic factors, feasibility scores, excess cases, seasonal residence, atypical demographics, etc. Multiple attributes, multiple indicators Ranking without having to integrate the multiple indicators into a composite index

Second stage screening Multiple Criteria Analysis, Multiple Indicators and Choices, Health Statistics, Disease Etiology, Health Policy, Resource Allocation First stage screening Significant clusters by SaTScan and/or upper level sets Second stage screening Multicriteria noteworthy clusters by partially ordered sets and Hass diagrams Final stage screening Follow up clusters for etiology, intervention based on multiple criteria using Hass diagrams

HUMAN ENVIRONMENT INTERFACE LAND, AIR, WATER INDICATORS for land - % of undomesticated land, i.e., total land area-domesticated (permanent crops and pastures, built up areas, roads, etc.) for air - % of renewable energy resources, i.e., hydro, solar, wind, geothermal for water - % of population with access to safe drinking water RANK COUNTRY LAND AIR WATER Sweden Finland Norway 5 Iceland 13 Austria 22 Switzerland 39 Spain 45 France 47 Germany 51 Portugal 52 Italy 59 Greece 61 Belgium 64 Netherlands 77 Denmark 78 United Kingdom 81 Ireland 69.01 76.46 27.38 1.79 40.57 30.17 32.63 28.34 32.56 34.62 23.35 21.59 21.84 19.43 9.83 12.64 9.25 35.24 19.05 63.98 80.25 29.85 28.10 7.74 6.50 2.10 14.29 6.89 3.20 0.00 1.07 5.04 1.13 1.99 100 98 82

Hasse Diagram (all countries)

Hasse Diagram (Western Europe)

Ranking Partially Ordered Sets – 5 Poset (Hasse Diagram) Linear extension decision tree a b a b c d c b a d e b c d c d a e f b e d e d c e d c c d d e f d e f e d e f e f e f f f f f e f f e f e f f e f e f e Jump Size: 1 3 3 2 3 5 4 3 3 2 4 3 4 4 2 2

Cumulative Rank Frequency Operator – 5 An Example of the Procedure In the example from the preceding slide, there are a total of 16 linear extensions, giving the following cumulative frequency table. Rank Element 1 2 3 4 5 6 a 9 14 16 b 7 12 15 c 10 d e f Each entry gives the number of linear extensions in which the element (row label) receives a rank equal to or better that the column heading

Cumulative Rank Frequency Operator – 6 An Example of the Procedure 16 The curves are stacked one above the other and the result is a linear ordering of the elements: a > b > c > d > e > f

Original Poset (Hasse Diagram) Cumulative Rank Frequency Operator – 7 An example where F must be iterated F 2 Original Poset (Hasse Diagram) a f e b c g d h F

Original Poset (Hasse Diagram) Cumulative Rank Frequency Operator – 8 An example where F results in ties Original Poset (Hasse Diagram) a c b d a b, c (tied) d F Ties reflect symmetries among incomparable elements in the original Hasse diagram Elements that are comparable in the original Hasse diagram will not become tied after applying F operator

Incorporating Judgment Poset Cumulative Rank Frequency Approach Certain of the indicators may be deemed more important than the others Such differential importance can be accommodated by the poset cumulative rank frequency approach Instead of the uniform distribution on the set of linear extensions, we may use an appropriately weighted probability distribution  , e.g.,

Network-based Surveillance Subway system surveillance Drinking water distribution system surveillance Stream and river system surveillance Postal System Surveillance Road transport surveillance Syndromic Surveillance

Figure 1. Upper Juniata river network with superimposed network of Pennsylvania Department of Environmental Protection (DEP) sampling stations. In Phase 1, scan statistics methods will be used to identify hotspots of biological impairment. Phase 2 incorporates potential explanatory factors and identifies residual hotspots that are subjected to detailed network modeling in Phase 3.

Procedure for obtaining the crisis index for network elements.

(left) The overall procedure, leading from admissions records to the crisis index for a hospital. The hotspot detection algorithm is then applied to the crisis index values defined over the hospital network. (right) The -machine procedure for converting an event stream into a parse tree and finally into a probabilistic finite state automaton (PFSA).

Using -complexity for Network Behavior Analysis David Friedlander (dsf10@psu.edu) Shashi Phoha (spx26@psu.edu) Richard Brooks (rrb5@psu.edu) Penn State / ARL The goal is to recognize meaningful patterns of activity caused by intruders within the network. We have selected mechanism powerful enough to represent intentional behavior while allowing quantitative analysis such as pattern matching. We are researching regular formal languages, which can be represented as finite state automata, to achieve this purpose. This presentation will show how formal languages can be derived from sensor network data and used to classify the underlying behaviors. We are working with UCLA in this area. Finally, I will present plans for continuing this work within the ESP program.

Tools for Recognizing Target Behavior From Network Measurements Symbolization Network Measurements (streams of numbers) Stream of symbols Conversion Higher level representation This slide outlines the problem addressed by this research. We have developed and applied methods to recognize higher level behaviors in the sense that sequences of actions belonging to the same formal language will be classified together. A goal of this research is to investigate whether behaviors based on formal languages can be used to represent the behavior of an intruder as would be inferred by a human observer. Representations of Known Behaviors Target Behavior Behavior Recognition

Natural Language Definition (Merriam-Webster’s Collegiate Dictionary) Behavior: 1 b : anything that an organism does involving action and response to stimulation c : the response of an individual, group, or species to its Technical Definitions The concept of a behavior is critical to this work. The natural language definition must be closely tied to the formal definition, which is used in the mathematical analysis, so that the results will be relevant to the ultimate end users. Definitions 1b and 1c are the ones most specifically relevant to surveillance. The only change would be the possible substitution of mechanism for organism. Behavior is defined in an open-ended way, as any finite or infinite set of sequences of actions and observations. It is left to the requirements phase to determine which behaviors are of interest, and therefore should be recognized. We will be addressing only regular formal languages at this stage. They include all finite sets of sequences and some infinite ones. In real-life situations, we do not need to consider arbitrarily long sequences of actions, so this restriction should not be too limiting. The last two definitions relate to Discrete Event Systems Control Theory. This is convenient for designing systems that to take action based on the behavior of the intruding mechanisms. Behavior → Pattern of observations and actions Pattern → Formal language Observations → Uncontrollable events Actions → Controllable events

Symbolization: Network Sensor Readings to Symbolic Dynamics Phase-Space Trajectory String of Symbols h a b g d c f e ……f c g h d a d c…… Sensor 3 The first step is to convert the continuous, multidimensional sensor network data into a stream of symbols. This is done by dividing the phase space of the system into n dimensional cubes where n is the dimension of the system. A symbol is output when the phase space trajectory of the system enters the corresponding cube. For realistic systems, this would result in too large an alphabet and too many states in the resulting automaton. The effective dimension can often be reduced using techniques from nonlinear dynamics or other areas. Sensor 2 Sensor 1

Conversion Tools: Stream of Symbols to FSA ……f c g h e d a d c…… A finite sample of an infinite stream of symbols can be used to define an automaton that approximates the underlying formal language. We will explain our techniques for doing this. Which defines a formal language of the target behavior

Conversion via topological complexity method 1. Language Sample 3. Merge topologically similar subtrees ……abaaabaaababababaaabab…. 2. Tree of all substrings of length l. a b a a,b b The above slide shown a sample of a stream of symbols generated by a sensor network. Since the stream has no beginning or end, there is no well-defined starting state. A tree structure can be created from the stream that contains all of its substrings up to some length, l. This tree appears to have fractal properties, i.e., the level 1 subtree of the right contains two copies of the level 1 subtree on the left. By merging equivalent states in a systematic way, we can derive an automaton that is essentially equivalent to the original unknown formal language. The topological complexity is defined as the log, base 2, of the number of states in the autonatom.

Conversion via -complexity method 1. Language Sample 2. Tree of all substrings of length l with transition probabilities ……abaaabaaababababaaabab…. a, P(a|0) b, P(b|0) a 1 2 3 4 5 6 7 8 9 b, P(b|2) a, P(a|2) b, P(b|3) a, P(a|3) 3. Merged subtrees must be topologically similar and have similar probability structures We can also count the number of times a given link in the tree structure is crossed by the substrings in the language sample. The results can be used to derive a probabilistic FSA whose substring statistics approximate that of the language sample. This is done by merging only subtrees whose topologies match and whose transition probabilities are the same to within some tolerance, . The epsilon complexity of the resulting PFSA is defined as the Shannon entropy of the state probabilities. -complexity

Behavior Classification Tool using Finite State Automata ……f c g h e d a d c…… Sample taken over “short” time scale Behavior 1 Behavior 2 String Rejections (1) / Sec String Rejections (2) / Sec One method of classifying the unknown behavior that is generating a stream of symbols is to compare it to the automata representing known behaviors. If a string belongs to the language defined by an automaton, it will accept the string. We can therefore look at the number of string rejections per second to determine which behavior most accurately describes the symbol stream. If the intersection of the two automata is null, the rejection rates can be compared directly. If the languages overlap, such as when one is a subset of the other, more care must be taken in the comparison. This will be the case for our demonstration this afternoon. Analyze Rejection Rates to find most likely known behavior for the sample (if any are close enough) Dynamic Target Behavior Changes over “long” time scale

Conversion Tools: Formal Languages to Infinite Dimensional Vector Space The space contains a vector for all possible languages of a given alphabet: For example: …..abababaaababaaaba….. Measures are defined on the vector space that satisfy: Recent work at Penn State has resulted in a vector space representation for formal languages with quantitative measures. This allows us to determine how close two behaviors are, based on a given measure, even if they have complexity measures that are equal. These measures can include evaluation functions which provide a way to optimize network control structures.

Weighted Counting Measure for Formal Languages Various measures can be defined on the formal language vectors, such as: where ni(L) is the number of strings of length i in language L, and where k is the number of symbols in the alphabet. A simple weighted counting measure is shown above as an example. The measure of *, the set of all strings of a given alphabet, is 1. The difference between two languages is the measure of their “exclusive or”. The distance between two languages is defined as:

Behavior Classification Tool using a Formal Language Measure ……f c g h e d a d c…… Sample taken over “short” time scale Convert to Vector The derivation of formal language vector spaces and formal language measures allow the use of traditional pattern matching techniques for target behaviors. An unknown behavior can be compared to a database of known behaviors by minimizing the measure of the difference between the unknown and known behaviors, as shown on the slide. Cluster analysis can be used on vectors of unknown behaviors to determine behavior classes. Dynamic Target Behavior Changes over “long” time scale Vectors of known behaviors

Future Work – Recognizing the Behavior of Multiple Targets Recognizing the behaviors of multiple targets Stages Finding stationary targets Finding moving targets Recognizing behaviors of multiple targets Methods Sensor energy surface Sensor cross-correlation Can we extend our methods to multiple targets, especially when there is coordinated activity? There are number of research issues in this area. We will start with detecting multiple stationary targets and extend the methods to tracking multiple moving targets. We will also need a formalism for describing coordinated activity among multiple targets and extend our methods to deal with the this formalism.

Future Work – Recognizing multiple targets – Method 1 Estimate the sensor energy density surface and look for local maxima that are “large and well separated” enough. Sensor energy surface

Future Work – Recognizing multiple targets – Method 2 Sensor cross-correlation Each sensor will contain a sum of uncorrelated time series from each vehicle within range, with time offsets that depend on the distance from the vehicle to the sensor. The time-shifted cross-correlation function between sensors will contain separate peaks for each vehicle.

Future Work – Behavior Recognition of Multiple, Coordinated Enemy Assets One model for coordinated behavior among multiple vehicles is a hierarchy of discrete event automata. Can our behavior recognition mechanism be extended to this? Can we the extend model recognition techniques to hierarchical control systems?

Experimental Validation Formal Language Events: a – green to red or red to green b – green to tan or tan to green c – green to blue or blue to green d – red to tan or tan to red e – blue to red or red to blue f – blue to tan or tan to blue Pressure sensitive floor This afternoon we will provide a demonstration of a validation experiments for target behavior recognition in sensor networks. The sensor network will be a pressure sensitive floor, and the target will be a robot. The behaviors are Wall Following and Random Walking. The experiment will attempt to detect the initial behavior and a change to the other behavior at a later time. Wall following Random walk Target Behavior Analyze String Rejections

Deterministic Finite Automata (DFA) b c start Directed Graph (loops & multiple edges permitted) such that: Nodes are called States Edges are called Transitions Distinguished initial (or starting) state Transitions are labeled by symbols from a given finite alphabet,  = {a, b, c, . . . } The same symbol can label several transitions A given symbol can label at most one transition from a given state (deterministic)

Deterministic Finite Automata (DFA) Formal Definition b c start Quadruple (Q, q0 , ,  ) such that: Q is a finite set of states  is a finite set of symbols, called the alphabet q0Q is the initial state  : Q    Q  {Blocked} is the transition function:  (q, a) = Blocked if there is no transition from q labeled by a  (q, a) = q' if a is a transition from q to q'

DFA and Strings c a b start Any path through the graph starting from the initial state determines a string from the alphabet. Example: The blue dashed path determines the string a b c a Conversely, any string from the alphabet is either blocked or determines a path through the graph. Example: The following strings are blocked: c, aa, ac, abb, etc. Example: The following strings are not blocked: a, b, ab, bb, etc. The collection of all unblocked strings is called the language accepted or determined by the DFA (all states are “final” in our approach)

Strings and Languages  = (finite) alphabet * = set of all (finite) strings from  A language is any subset of *. Not all languages can be determined by a DFA. Different DFAs can accept the same language

Probabilistic Finite Automata (PFA) q0 a, .4 b, .2 b, 1 b, .5 c, .6 c, .5 start a, .8 A PFA is a DFA (Q, q0 , ,  ) with a probability attached to each transition such that the sum of the probabilities across all transitions from a given node is unity. Formally, p: Q    [0, 1] such that p(q, a) = 0 if and only if  (q, a) = Blocked Multiplying branch probabilities lets us assign a probability value (q0, s) to each string s in *. E.G., (q0, abca)=(.8)1(.6)(.4)=.192

Properties of (q0, s) For fixed q0, (q0, s) is a measure on * Support of  is the language accepted by the DFA For fixed q0, (q0, s) is a probability measure on i ( i = strings of length i ) This probability measure is written as (i). Given a probability distribution w(i) across string lengths i, defines a probability measure across *, called the w-weighted probability measure of the PFA. If all w(i) are positive, then the support of  is also the language accepted by the underlying DFA.

Distance Between Two PFA Let A and B be two PFAs on the same alphabet  Let w(i) be a probability distribution across string lengths i Let A and B be the w-weighted probability measures of A and B Define the distance between A and B as the variational distance between the probability measures A and B : d(A, B) = || A  B ||

Disease Mapping and Evaluation Elevated rate cluster detection and assessment Geographic surveillance and evaluation Comparative studies involving statistical power Geometric accuracy in cluster estimation Real data projects and validations Computing and software

Agent Orange Dioxin Hotspots, coldspots, and midspots Identification and prioritization Persistent hotspot trajectories Spatial disease monitoring, modeling, and tracking Health effect surveillance system

Agent Orange Dioxin Ecosystem impact assessment Deforestation, defoliation, remote sensing Hyperspectral change detection and accuracy assessment Emergent advanced raster map analysis system

Crop Biosurveillance/Biosecurity

Crop Biosurveillance/Biosecurity

Crop Biosurveillance/Biosecurity

Emergent Surveillance Plexus Surveillance Sensor Network Testbed Autonomous Ocean Sampling Network Types of Hotspots Hotspots due to multiple, localized, stationary sources Hotspots corresponding to areas of interest in a stationary mapped field Time-dependent, localized hotspots Hotspots due to moving point sources

Ocean SAmpling MObile Network OSAMON

Ocean SAmpling MObile Network OSAMON Feedback Loop Network sensors gather preliminary data ULS scan statistic uses available data to estimate hotspot Network controller directs sensor vehicles to new locations Updated data is fed into ULS scan statistic system

Surveillance Sensor Network Testbed High - Indoor / Outdoor GPS for cooperative navigation Cross-platform joint service experiments Operational testing for more realistic adaptation Aerial Mapping adds new perspective Expand data storage and processing Larger cross-platform teams capture realistic cooperation and dynamic adaptation

SAmpling MObile Networks (SAMON) Additional Application Contexts Hotspots for radioactivity and chemical or biological agents to prevent or mitigate the effects of terrorist attacks or to detect nuclear testing Mapping elevation, wind, bathymetry, or ocean currents to better understand and protect the environment Detecting emerging failures in a complex networked system like the electric grid, internet, cell phone systems Mapping the gravitational field to find underground chambers or tunnels for rescue or combat missions

Environmental Justice Is environmental degradation worse in poor and minority communities? Identification of persistent poverty and its trajectories Identification of pollution proximity and vulnerability Co-location research hampered by the lack of methods and tools of investigation

Space-Time Poverty Hotspot Typology Federal Anti-Poverty Programs have had little success in eradicating pockets of persistent poverty Can spatial-temporal patterns of poverty hotspots provide clues to the causes of poverty and lead to improved location- specific anti-poverty policy ?

Dimensions of Tract Poverty in Four Metropolitan Areas 1970-1990 Concentrated Persistent Camden, NJ Detroit, MI Oakland, CA Memphis, TN Growing Shifting

Growing poverty

Persistent poverty

Shifting poverty Oakland 1970 Poverty data Oakland 1980 Poverty data

Concentrated poverty Camden 1970 Poverty data Camden 1980 Poverty data

Hotspot Persistence Persistence is a property of space-time hotspots Space (census tract) 1970 1980 1990 2000 Time (census year) Persistent Hotspot Long Duration Persistence is a property of space-time hotspots Persistence can be assessed by the projection of the space-time hotspot onto the time axis Space (census tract) 1970 1980 1990 2000 Time (census year) One-shot Hotspot Duration Short

Typology of Persistent Space-Time Hotspots-1 Space (census tract) 1970 1980 1990 2000 Time (census year) Stationary Hotspot Space (census tract) 1970 1980 1990 2000 Time (census year) Expanding Hotspot Space (census tract) 1970 1980 1990 2000 Time (census year) Shifting Hotspot Space (census tract) 1970 1980 1990 2000 Time (census year) Contracting Hotspot

Typology of Persistent Space-Time Hotspots-2 Space (census tract) 1970 1980 1990 2000 Time (census year) Bifurcating Hotspot These hotspots are connected in space-time However, certain time slices of the hotspot are disconnected in space Space (census tract) 1970 1980 1990 2000 Time (census year) Merging Hotspot Spatially disconnected time slice

Trajectory of a Persistent Space-Time Hotspot A space-time hotspot is a three-dimensional object Visualization can be done by displaying the sequence of time slices---called the trajectory of the hotspot Time slices of space-time hotspot Space (census tract) 1970 1980 1990 2000 Time (census year) Merging Hotspot

Trajectory of a Merging Hotspot 1970 1980 1990 2000

Trajectory of a Shifting Hotspot 1970 1980 1990 2000

Multiscale Advanced Raster Map Analysis Geospatial Patterns and Pattern Metrics Landscape patterns, disease patterns, mortality patterns Surface Topology and Spatial Structure Hotspots, outbreaks, critical areas Intrinsic hierarchical decomposition, study areas, reference areas Change detection, change analysis, spatial structure of change

Needed Downloadable geoinformatic analysis system Identification of arbitrary shape hotspots, coldspots, midspots Multi-criteria prioritization and ranking Synoptic and network-based surveillance Multiple scales and aggregation levels Space, time, and space-time Zonal tree scan system

Terra Seer Long Island Study-1 August 13, 2002 GEOGRAPHIC TRACE OF CANCER The Geographic Distribution of BLC Cancers in Long Island NY Geographic Trace of Cancer (the pattern of cases over time and space) may even become as important as genetic research in the search for causes Need of an Exploratory, Integrative, Multi-scalar Approach

Terra Seer Long Island Study-2 August 13, 2002 STRENGTHENING SATSCAN The techniques we employ are not ‘better’ or ‘more accurate’ than the scan statistic Using a battery of approaches allows us to quantify different aspects of univariate and multivariate clusters, to explore different scales of clustering, and to evaluate how sensitive the results are to different definitions of clustering The methods we have used are not designed to replace the scan approach, rather they should be used with the scan statistic to develop a more comprehensive understanding of geographic variation of cancer morbidity

Terra Seer Long Island Study-3 August 13, 2002 HOTSPOTS, PUBLIC AND AGENCIES Public: Being singled out as an elevated cluster community is such a source of dread for many people Agency: Hotspot status creates intense pressure to study individual clusters instead of broader cancer patterns Need: Hotspot rating along with hotspot detection, prioritization, and ranking interval with multicriteria

Terra Seer Long Island Study-4 August 13, 2002 Issues and Approaches Local Moran test , Boundary Detection, Boundary Overlap, Multi-Scalar, and Spatially Constrained Clustering Adjacency: Centroid, common border, upper level set tree neighbor Multiscalar: Multivariate response, Multivariate Binomial, Poisson, Gamma, Lognormal Scale of Study: Spatial extent, Hotspot in Hotspots Scale of Process: Resolution, Aggregation Level

Recommendations Research, education, and outreach initiative National Center Multi-focal multi-disciplinary thrust Concepts, methods, tools, validations Solid, sound, sophisticated and yet user-friendly Priority activities and regional projects Downloadable user-friendly software system Innovative Synergistics

Penn State Involvements Center for Statistical Ecology and Environmental Statistics/NSF,EPA Poverty Research Center/Ford Foundation GeoVista Center/NCI, NSF Geovisualization and Spatial Analysis of Cancer Data NCI Program For: Geographic-based Research in Cancer Control and Epidemiology Appalachia Cancer Network/NCI Environmental Consortium on Biosurveillance Cooperative Wetlands Center/EPA Center for Remote Sensing of Earth Resources/NASA, EPA

Logo for Statistics, Environment, Health, Ecology, and Society