Discovery of Patterns in the Global Climate System using Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Slides:



Advertisements
Similar presentations
Describe the general atmospheric and oceanic circulation patterns that characterize La Nina and El Nino Describe the effects of La Nina and El Nino Explain.
Advertisements

3. Natural Climate Variability. SPM 1b Variations of the Earth’s surface temperature for the past 1,000 years Approx. climate range over the 900 years.
Downstream weather impacts associated with atmospheric blocking: Linkage between low-frequency variability and weather extremes Marco L. Carrera, R. W.
3. Climate Change 3.1 Observations 3.2 Theory of Climate Change 3.3 Climate Change Prediction 3.4 The IPCC Process.
Excerpts of the AAAS Fiscal Year 2009 Appropriations Summary ( September 2008 summary: Congress has made little progress on the federal.
Climate Impacts Discussion: What economic impacts does ENSO have? What can we say about ENSO and global climate change? Are there other phenomena similar.
May 2007 vegetation Kevin E Trenberth NCAR Kevin E Trenberth NCAR Weather and climate in the 21 st Century: What do we know? What don’t we know?
Climate and Food Security Thank you to the Yaqui Valley and Indonesian Food Security Teams at Stanford 1.Seasonal Climate Forecasts 2.Natural cycles of.
El Nino Southern Oscillation (ENSO)
Lesson 11: El Niño Southern Oscillation (ENSO) Physical Oceanography
Bryson McNulty EAS-4480 Georgia Inst. Of Technology April 22 nd, 2014.
Weather: The state of the atmosphere at a given time and place, with respect to variables such as temperature, moisture, wind velocity and direction,
Extreme Events and Climate Variability. Issues: Scientists are telling us that global warming means more extreme weather. Every year we seem to experience.
Abstract Predation and winter range have often dominated debates about caribou population dynamics. However, climate has direct and indirect impacts on.
Climate.
3. Climate Change 3.1 Observations 3.2 Theory of Climate Change 3.3 Climate Change Prediction 3.4 The IPCC Process.
Are Exceptionally Cold Vermont Winters Returning? Dr. Jay Shafer July 1, 2015 Lyndon State College 1.
1 Istanbul Technical University / Civil Engineering Department Ercan Kahya Istanbul Technical University.
6-2 Climate and Biomes. Where is the water and life at?
School of Information Technologies The University of Sydney Australia Spatio-Temporal Analysis of the relationship between South American Precipitation.
December 2002 Section 2 Past Changes in Climate. Global surface temperatures are rising Relative to average temperature.
Weather Temporary behavior of atmosphere (what’s going on at any certain time) Small geographic area Can change rapidly.
S6E2.c. relate the tilt of earth to the distribution of sunlight through the year and its effect on climate.
11 Spatial Data Mining CS 697 Assignment 1 February 16, 2010 Pradnya Khutafale, Peter Lucas, and Chris Maio Advisor: Dr. Wei Ding Computer Science Department.
Dynamic Climate An overview of Climate Oscillations.
By Anthony R. Lupo Department of Soil, Environmental, and Atmospheric Science 302 E ABNR Building University of Missouri Columbia, MO
NGDM 2009 panel on Climate Change Mining Climate and Ecosystem Data : Challenges and Opportunities Vipin Kumar University of Minnesota.
The La Niña Influence on Central Alabama Rainfall Patterns.
Water Year Outlook. Long Range Weather Forecast Use a combination of long term predictors –Phase of Pacific Decadal Oscillation (PDO) –Phase of Atlantic.
Implications of climate variability & change 1 WMO Expert Meeting on CAT Insurance & Weather Risk Management Markets December 7, 2007 Climate Variability.
Teleconnections Current Weather Current Weather Finish ENSO Finish ENSO North Atlantic and Arctic Oscillations North Atlantic and Arctic Oscillations Pacific.
Abnormal Weather October 22, Teleconnections Teleconnections: relationship between weather or climate patterns at two widely separated locations.
C20C Workshop ICTP Trieste 2004 The Influence of the Ocean on the North Atlantic Climate Variability in C20C simulations with CSRIO AGCM Hodson.
The ability to accurately predict climate in the New York metropolitan area has tremendous significance in terms of minimizing potential economic loss.
Most of this course is about the basic state of the Earth’s climate system and its governing forces. However the climate state varies across a wide range.
© 2005 Accurate Environmental Forecasting Climate and Hurricane Risk Dr. Dail Rowe Accurate Environmental Forecasting
Relationship between interannual variations in the Length of Day (LOD) and ENSO C. Endler, P. Névir, G.C. Leckebusch, U. Ulbrich and E. Lehmann Contact:
By Matt Masek March 22, Outline Review of 2011 – 2012 Winter Role of La Niña and Arctic Oscillation Spring Outlook One month (April) outlook Three.
Statistical Summary ATM 305 – 12 November Review of Primary Statistics Mean Median Mode x i - scalar quantity N - number of observations Value at.
Climate: The average, year-after-year conditions of temperature, precipitation, winds and clouds in an area.
Interannual Time Scales: ENSO Decadal Time Scales: Basin Wide Variability (e.g. Pacific Decadal Oscillation, North Atlantic Oscillation) Longer Time Scales:
DATA GUIDED DISCOVERY OF DYNAMIC DIPOLES 1. Dipoles Dipoles represent a class of teleconnections characterized by anomalies of opposite polarity at two.
Discovery of Climate Indices using Clustering Michael Steinbach Steven Klooster Christopher Potter Rohit Bhingare, School of Informatics University of.
The CFS ensemble mean (heavy blue line) predicts La Nina will last through at least the Northern Hemisphere spring
El Niňo. El Nińo: A significant increase in sea surface temperature over the eastern and central equatorial Pacific that occurs at irregular intervals,
Indo-Pacific Sea Surface Temperature Influences on Failed Consecutive Rainy Seasons over Eastern Africa** Andy Hoell 1 and Chris Funk 1,2 Contact:
Ahira Sánchez-Lugo October 20, 2015 NOAA’s National Centers for Environmental Information.
© Vipin Kumar IIT Mumbai Case Study 2: Dipoles Teleconnections are recurring long distance patterns of climate anomalies. Typically, teleconnections.
Lecture 9: Air-Sea Interactions EarthsClimate_Web_Chapter.pdfEarthsClimate_Web_Chapter.pdf, p ; Ch. 16, p ; Ch. 17, p
Anomalous Behavior Unit 3 Climate of Change InTeGrate Module Cynthia M. Fadem Earlham College Russian River Valley, CA, USA.
Winter Outlook for the Pacific Northwest: Winter 06/07 14 November 2006 Kirby Cook. NOAA/National Weather Service Acknowledgement: Climate Prediction Center.
MICHAEL A. ALEXANDER, ILEANA BLADE, MATTHEW NEWMAN, JOHN R. LANZANTE AND NGAR-CHEUNG LAU, JAMES D. SCOTT Mike Groenke (Atmospheric Sciences Major)
 On a climograph, what data are represented with bars? ◦ What data are represented with a line graph?  How can you determine the climate classification.
Climatic Changes. Standards 4d: Students know the differing Greenhouse conditions on Earth, Mars and Venus; the origins of those conditions; and the climatic.
Complication in Climate Change
Daylength Local Mesoscale Winds Chinook Winds (Foehn) Loma, MT: January 15, 1972, the temperature rose from -54 to 49°F (-48 to 9°C), a 103°F (58°C)
Air-Sea Interactions The atmosphere and ocean form a coupled system, exchanging heat, momentum and water at the interface. Emmanuel, K. A. 1986: An air-sea.
Teleconnections.
Teleconnection Systems NAO AO PNA
Question 1 Given that the globe is warming, why does the DJF outlook favor below-average temperatures in the southeastern U. S.? Climate variability on.
Climate Changes.
Climate Change & Tropical Cyclones
Climate changes Earth is constantly changing, including the climate.
El Niño-Southern Oscillation
Global Climate Change.
Climate Changes due to Natural Processes
Pradnya Khutafale, Peter Lucas, Computer Science Department
Understanding and forecasting seasonal-to-decadal climate variations
The Data Set.
Presentation transcript:

Discovery of Patterns in the Global Climate System using Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota Collaborators: G. Karypis, S. Shekhar, M. Steinbach, P.N. Tan (AHPCRC), C. Potter, (NASA Ames Research Center), S. Klooster (California State University, Monterey Bay). This work was partially funded by NASA and Army High Performance Computing Center © V. Kumar Discovery of Patterns in the Global Climate System using Data Mining

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 2 Research Goals A key interest is finding connections between the ocean and the land. l Global snapshots of values for a number of variables on land surfaces or water. l Monthly over a range of 10 to 50 years. Research Goals: l Find global climate patterns of interest to Earth Scientists

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 3 Sources of Earth Science Data l Before 1950, very sparse, unreliable data. l Since 1950, reliable global data. –Ocean temperature and pressure are based on data from ships. –Most land data, (solar, precipitation, temperature and pressure) comes from weather stations. l Since 1981, data has been available from earth orbiting satellites. – FPAR, a measure related to plants and greenness l Since 1999 TERRA, the flagship of the NASA earth observing system, is providing much more detailed data.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 4 l The climate of the Earth’s land surface is strongly influenced by the behavior of the Earth’s oceans. –El Nino is the anomalous warming of the eastern tropical region of the Pacific. –Associated with droughts in Australia and Southern Africa and heavy rainfall along the western coast of South America. Importance of Global Climate Patterns El Nino Events Sea Surface Temperature Anomalies off Peru (ANOM 1+2)

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 5 Importance of Global Climate Patterns and NPP l Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO 2 ) into organic matter by plants. –NPP is driven by solar radiation and can be constrained by precipitation and temperature. l Keeping track of NPP is important because it includes the food source of humans and all other organisms. –Sudden changes in the NPP of a region can have a direct impact on the regional ecology. l NPP is impacted by global climate patterns. –Precipitation and temperature are directly affected by global climate patterns such as El Nino. –Solar radiation is affected indirectly by cloudiness.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 6 Role of Statistics and Data Mining l Previously Earth scientists have relied on statistical techniques. –Hypothesize-and-test paradigm is extremely labor- intensive. l Data mining provides earth scientist with tools that allow them to spend more time choosing and exploring interesting families of hypotheses. –By applying the proposed data mining techniques, some of the steps of hypothesis generation and evaluation will be automated, facilitated and improved. l However, statistics is needed to provide methods for determining the “statistical” significance of results.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 7 Patterns of Interest l Zone Formation –Find regions of the land or ocean which have similar behavior. l Teleconnections –Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth. l Associations –Find relations between climate events and land cover. l River Discharge –Relationship between water discharged from a river and precipitation, climate, and man.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 8 Clustering for Zone Formation l Interested in relationships between regions, not “points.” l For ocean, clustering based on SST (Sea Surface Temperature) or SLP (Sea Level Pressure). l For land, clustering based on NPP or other variables, e.g., precipitation, temperature. –Typically we work with the points. l When “raw” NPP and SST are used, clustering can find seasonal patterns. –Anomalous regions have plant growth patterns which reversed from those typically observed in the hemisphere in which they reside, and are easy to spot.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 9 K-Means Clustering of Raw NPP and Raw SST (Num clusters = 2)

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 10 K-Means Clustering of Raw NPP and Raw SST (Num clusters = 2) Land Cluster Cohesion: North = 0.78, South = 0.59 Ocean Cluster Cohesion: North = 0.77, South = 0.80

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 11 K-Means Clustering of Raw NPP and Raw SST (Num clusters = 6)

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 12 Preprocessing l Time series preprocessing issues –Need to remove seasonality  Earth scientists mostly interest in anomalies –Need to remove most of the autocorrelation  Statistical test are affected –Need to remove trends  Normally want to detect patterns and trends separately –Normally interested in similarity once differences in means and scale have been considered.  Pearson’s correlation coefficient has this property

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 13 Sample NPP Time Series MinneapolisAtlantaSao Paolo Minneapolis Atlanta Sao Paolo Correlations between time series Minneapolis

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 14 Seasonality Accounts for Much Correlation MinneapolisAtlantaSao Paolo Minneapolis Atlanta Sao Paolo Correlations between time series Minneapolis Normalized using monthly Z Score: Subtract off monthly mean and divide by monthly standard deviation

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 15 Removing Seasonality Removes Most Autocorrelation

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 16 Preprocessing: Removing Trends A slight linear trend added to two random time series increases their correlation dramatically, from 0.01 to 0.17.

Ocean Climate Indices: Connecting the Ocean and the Land l An OCI is a time series of temperature or pressure –Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP) l OCIs are important because –They distill climate variability at a regional or global scale into a single time series. –They are well-accepted by Earth scientists. –They are related to well-known climate phenomena such as El Niño. © V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 17

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 18 Ocean Climate Indices – ANOM 1+2 l ANOM 1+2 is associated with El Niño and La Niña. l Defined as the Sea Surface Temperature (SST) anomalies in a regions off the coast of Peru l El Nino is associated with –Droughts in Australia and Southern Africa –Heavy rainfall along the western coast of South America –Milder winters in the Midwest El Nino Events

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 19 Connection of ANOM 1+2 to Land Temp OCIs capture teleconnections, i.e., the simultaneous variation in climate and related processes over widely separated points on the Earth.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 20 Ocean Climate Indices - NAO l The North Atlantic Oscillation (NAO) is associated with climate variation in Europe and North America. l Normalized pressure differences between Ponta Delgada, Azores and Stykkisholmur, Iceland. l Associated with warm and wet winters in Europe and in cold and dry winters in northern Canada and Greenland l The eastern US experiences mild and wet winter conditions. Iceland Azores

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 21 Connection of NAO to Land Temp

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 22 l Correlation of an OCI with a land variable is a standard way to evaluate its “influence.” –Correlation does not imply causality. –Temperature and precipitation are the typical land variables. l If relatively many land points have a relatively high correlation, then an OCI is influential. l To evaluate whether clusters (or pairs) are potential OCIs we compute their area weighted correlation. –Weighted average of the correlation with land points, where weight is based on area. –May exclude points whose correlation is low and then calculate area weighted correlation. Influence of OCI on Land – Area Weighted Correlation

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 23 Evaluation of Known OCIs via Area Weighted Correlation Area Weighted Correlation of Known OCIs to Land Temp Overlapping, threshold = 0

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 24 Evaluation of Known OCIs via Area Weighted Correlation … Area weighted correlation declines as we consider only land points whose temperature correlates with the OCI above a given threshold.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 25 Discovering OCIs via Data Mining l Earth scientists have discovered currently known OCIs. –Observation –Eigenvalue techniques such as Principal Components Analysis (PCA) and Singular Value Decomposition (SVD). l Clustering provides an alternative approach. –Clusters represent ocean regions with relatively homogeneous behavior. –The centroids of these clusters are time series that summarize the behavior of these ocean areas, and thus, represent potential OCIs.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 26 l Not all points on the ocean correlate well with land variables such as temperature and precipitation. l Best points are those which have a high “density” –Dense points are relatively homogenous with respect to their neighboring points. Finding Influential Ocean Regions

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 27 l Use clustering to find areas of the oceans that have high density, I.e., relatively homogeneous behavior. –Cluster centroids are potential OCIs. –For SLP pairs of cluster centroids are potential OCIs. l Evaluate the “influence” of potential OCIs on land points. l Determine if the potential OCI matches a known OCI. l For potential OCIs that are not well-known, conduct further evaluation. –Are there land points that have higher correlation for the potential OCI than for known indices? Discovery of Ocean Climate Indices

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 28 SST Clusters

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 29 l Evaluation will be based on area weighted correlation –Ignore clusters who area weighted correlation is low. l Three cases: –Clusters are highly similar to known OCIs (corr > 0.4)  May represent a known OCI  Clusters may be “better,” i.e., higher coverage  Clusters may cover different area, i.e., some points for which the new OCI is a better predictor –Clusters are moderately similar to known OCIs ( 0.25 < corr < 0.4 )  Again, new OCIs may be better predictors for some points. –Clusters are not similar to known OCIs (corr < 0.25)  These clusters may represent as yet undiscovered Earth Science phenomena. Evaluating Cluster Centroids as Potential OCIs

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 30 SST Clusters Highly Correlated to Known Indices Area Weighted Correlation of Cluster Centroids to Land Temp Overlapping, threshold = 0

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 31 SST Clusters Highly Correlated to Known Indices …

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 32 SST Clusters that Correspond to El Nino Climate Indices SNN clusters of SST that are highly correlated with El Nino indices, ~ 0.93 correlation. El Nino Regions Defined by Earth Scientists

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 33 SST Clusters Highly Correlated to Known Indices … Examples of some SST clusters that are highly correlated to known OCIs and have high area weighted correlation with land temperature. These indices have a significant correlation with El Nino indices.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 34 However, there are areas (yellow) where these clusters correlate better. SST Clusters Highly Correlated to Known Indices

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 35 SST Clusters Highly Correlated to Known Indices

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 36 SST Cluster Moderately Correlated to Known Indices

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 37 Comments from our NASA collaborators “Ocean cluster results based on SST correlations with land surface temperature suggest that –New areas of the ocean may be identified that are unknown as being highly representative of the El Nino Southern Oscillation (ENSO) and the Arctic Oscillation (AO).” –New predictive indices for land climate over the past 40 years can be identified that will improve upon predictions using any known ocean climate index to date, including SOI and AO.”

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 38 Issues in Mining Associations from Earth Science Data l Data is continuous rather than discrete. l Data has spatial and temporal components. l Data can be multilevel –time and spatial granularities. l Observations are not i.i.d. due to spatial and temporal autocorrelations. l Data may contain noise, missing information and measurement errors –historical SST data between is measured using wooden buckets. l Data may come from heterogeneous sources –Calibration issues.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 39 Mining Associations in Earth Science Data: Challenges l How to transform Earth Science data into transactions? –What are the “baskets”? –What are the “items”? –How to define “support”?

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 40 Mining Associations Patterns in Earth Science Data: Challenges l How to efficiently discover spatio-temporal associations? –Use existing algorithms. –Develop new algorithms. 1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%) 2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%) 3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%) 4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%) … How to identify interesting patterns? –Use objective interest measures. –Use domain knowledge.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 41 Event Definition l Items are events abstracted from time series. l Events of interest include: –Temporal events:  Anomalous temporal events such as warmer winters and droughts.  Changes in the periodic behavior such as longer growing seasons or earlier month of onset of greenup. –Spatial events:  Large percentage of land areas in a certain region having below-average precipitation. –Spatio-temporal events:  Changes in circulation or trajectory of jet-streams.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 42 Example of Anomalous Event Definition If threshold for Z =  1.5, on average, there are ~20 events per time series.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 43 Transaction and Support Definitions l Convert the time series into sequence of events for each spatial location.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 44 Examples of Association Patterns 1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%) 2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%) 3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%) 4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%) 5 FPAR-HI PET-HI PREC-HI SOLAR-LO TEMP-HI ==> NPP-HI (support count=44, confidence=97.8%) 6 FPAR-LO PET-LO PREC-LO SOLAR-LO ==> NPP-LO (support count=216, confidence=96.9%) 7 FPAR-LO PREC-LO SOLAR-LO TEMP-HI ==> NPP-LO (support count=152, confidence=96.2%) 8 FPAR-LO PET-LO PREC-LO SOLAR-LO TEMP-LO ==> NPP-LO (support count=47, confidence=95.9%) 9 FPAR-LO PREC-LO SOLAR-LO TEMP-LO ==> NPP-LO (support count=49, confidence=94.2%) 10 FPAR-LO PREC-LO SOLAR-LO ==> NPP-LO (support count=595, confidence=93.7%) … 75 FPAR-HI ==> NPP-HI (support count = , confidence = 55.7%) l min support = 0.001%, min confidence=10% NPP = Solar * FPAR *  * Temperature * Moisture

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 45 Example of Interesting Association Patterns FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%) Shrubland areas Rule has high support in shrubland areas

Land Cover Types Shrublands/

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 47 Using Land Cover as Additional Features 1. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%) 2. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI GRASSLAND ==> NPP-HI (support count=145, confidence=100%) 3. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI FOREST ==> NPP-HI (support count=44, confidence=100%) 4. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND ==> NPP-HI (support count=44, confidence=100%) 5. FPAR-HI PET-HI PREC-HI SOLAR-HI FOREST ==> NPP-HI (support count=75, confidence=100%) 6. FPAR-HI PET-HI PREC-HI SOLAR-HI CROPLAND ==> NPP-HI (support count=81, confidence=100%) 7. FPAR-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND ==> NPP-HI (support count=58, confidence=100%) 8. FPAR-HI PET-HI PREC-HI TEMP-HI GRASSLAND ==> NPP-HI (support count=376, confidence=99.5%) 9. FPAR-HI PET-HI PREC-HI TEMP-HI CROPLAND ==> NPP-HI (support count=170, confidence=99.4%) 10. FPAR-HI PET-HI PREC-HI CROPLAND ==> NPP-HI (support count=277, confidence=99.3%) ….. Produce multiple rules that have the same form: {A} ==> {B}, {A,Grassland} ==> B, {A, Cropland} ==> {B}, etc. Some of the support counts could be missing if itemsets fall below the minimum support threshold.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 48 Finding Interesting Earth-Science Patterns l A pattern is interesting if it occurs relatively more frequently in some homogeneous regions. l If the relative frequency of a pattern is similar in all groups of land areas, then it is less interesting. l If the pattern occurs mostly in a certain group of land areas, then it is potentially interesting.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 49 Filtering Patterns using Land Cover Types For each pattern p: Actual coverage for land cover type i = s i /S Expected coverage for land cover type i = n i /N Ratio of actual to expected coverage for land cover type i, e i = s i N / n i S Interest Measure If pattern occurs in arbitrary regions, interest measure will be low.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 50 Interesting Spatial Association Pattern FPAR-Hi Prec-Hi ==> NPP-Hi FPAR-Hi  NPP-Hi tends to occur in shrubland and grassland regions. Possible explanation: this type of vegetation can take advantage of periodically high precipitation more quickly than forests. New hypothesis: FPAR-HI events in these regions could be related to unusual precipitation conditions. FPAR-Hi ==> NPP-Hi Forest (0.1%), Croplands (4.2%), Grasslands (25.9%), Desert (3.6%)

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 51 Interesting Spatial Association Pattern… Prec-Hi  NPP-Hi tends to occur in grassland and cropland regions. Land Cover

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 52 Temp-Hi  NPP-Hi tends to occur in the forest and cropland regions in the northern hemisphere (Forests (33.5%), Grassland(8.7%), Cropland (24.5%), Desert (0.4%) ) Other Interesting Spatial Association Patterns Support Count Land Cover

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 53 Global River Discharge Data 30 rivers, 0.5 degree resolution Two measurement stations: mouth and source of river system/basin Minimum of ten continuous years of monthly station discharge records Interesting associations e.g., Amazon discharge is highly correlated with ANOM3.4(r = -0.5)

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 54 Relationship Between River Basin PREC and OCI Correlation between PREC aggregation on river basins and OCI is shown in left figure Interesting Observations: Amazon and Parana are nearby, however, the signals to OCI are almost reverse Amazon Parana

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 55 Discharge Data: Amazon vs. Parana

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 56 Relationship Between River Basin PREC and OCI.. Interesting Observations: Petchora and Pacific Decadal Oscillation (PDO) are highly correlated. Petchora

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 57 Correlation between PREC and DISCHARGE Nile Yenisei Danube Amazon

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 58 Correlation between OCI and DIS (r = 0.3) AmazonColumbiaColorado Amu-Darya Brahmaputra

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 59 Proposed Framework of River Analysis

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 60 Conclusions l Association rules can uncover interesting patterns for Earth Scientists to investigate. –Challenges arise due to spatio-temporal nature of the data. –Need to incorporate domain knowledge to prune out uninteresting patterns. l By using clustering we have made some progress towards automatically finding climate patterns that display interesting connections between the ocean and the land. –Need to further evaluate candidates for new climate indices. l Correlation analysis on river discharge data can be used to evaluate the effects of climate and man.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 61 Case Studies: Earth Science Data l Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Chris Potter, Steven Klooster, Alicia Torregrosa, “Clustering Earth Science Data: Goals, Issues and Results”, Workshop on Mining Scientific Data, KDD 2001, San Francisco, CA, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Steven Klooster, Christopher Potter, Alicia Torregrosa, “Finding Spatio-Termporal Patterns in Earth Science Data: Goals, Issues and Results,” Temporal Data Mining Workshop, KDD 2001, San Francisco, CA, l Vipin Kumar, Michael Steinbach, Pang-Ning Tan, Steven Klooster, Chris Potter, Alicia Torregrosa, “Mining Scientific Data: Discovery of Patterns in the Global Climate System”, Joint Statistical Meetings, Atlanta, GA, l Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Chris Potter, Steven Klooster, “Data Mining for the Discovery of Ocean Climate Indices”, Workshop on Mining Scientific Data, SDM 2002.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 62 Statistical Issues l Temporal Autocorrelation –Makes it difficult to calculate degrees of freedom and determine significance levels for tests, e.g., non-zero correlation. –Moving average is nice for smoothing and seeing the overall behavior, but introduces additional autocorrelation. –Removal of seasonality removes much of the autocorrelation (as long as not performed via the moving average). l Measures of time series similarity –Detecting non-linear connections –Detecting connections that only exist at certain times.  Sometimes only extreme events have an effect. –Automatically detecting appropriate time lags. –Statistical tests for more sophisticated measures.

© V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 63 Statistical Issues … l Detecting spurious connections. –We are performing many correlation calculations and there is a chance of spurious correlations.  Given that we have ~100,000 locations on the Earth for which we have time series, how many spuriously high correlations will we get when we calculate the correlation between these locations and a climate index? –Because of spatial autocorrelation, these correlations are not independent.  Again we have trouble calculating the degrees of freedom.