1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn.

Slides:



Advertisements
Similar presentations
Problem solving with graph search
Advertisements

gSpan: Graph-based substructure pattern mining
Seismic Stratigraphy EPS 444
Using the Crosscutting Concepts As conceptual tools when meeting an unfamiliar problem or phenomenon.
GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington.
1 Enviromatics Spatial database systems Spatial database systems Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
CONNECTIVITY “The connectivity of a network may be defined as the degree of completeness of the links between nodes” (Robinson and Bamford, 1978).
Seismo-Surfer a tool for collecting, querying, and mining seismic data Yannis Theodoridis University of Piraeus
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder.
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.
Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
Data Mining.
FLAIRS Graph-Based Concept Learning Jesus Gonzalez, Lawrence Holder and Diane Cook Department of Computer Science and Engineering The University.
Subdue Graph Visualizer by Gayathri Sampath, M.S. (CSE) University of Texas at Arlington.
Detecting and Tracking of Mesoscale Oceanic Features in the Miami Isopycnic Circulation Ocean Model. Ramprasad Balasubramanian, Amit Tandon*, Bin John,
GUI implementation for Supervised and Unsupervised SUBDUE System.
Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington
Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.
Data Mining – Intro.
The Shortest Path Problem
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
Studying Earthquakes. Seismology: the study of earthquakes and seismic waves.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data Mining Techniques
Data Mining Solutions (Westphal & Blaxton, 1998) Dr. K. Palaniappan Dept. of Computer Engineering & Computer Science, UMC.
Data Mining Chun-Hung Chou
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
` Tangible Interaction with the R Software Environment Using the Meuse Dataset Rachel Bradford, Landon Rogge, Dr. Brygg Ullmer, Dr. Christopher White `
Chapter 1 Introduction to Data Mining
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
CHAPTER ONE The Scientific Method. Section 1: What is Science?  Science:  a way of learning more about the natural world.  questions about art, politics,
Chapter 3 Digital Representation of Geographic Data.
8. Geographic Data Modeling. Outline Definitions Data models / modeling GIS data models – Topology.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Rob and Coronel Adapted for INFS-3200.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Language Objective: Students will be able to practice agreeing and disagreeing with partner or small group, interpret and discuss illustrations, identify.
Tables tables are rows (across) and columns (down) common format in spreadsheets multiple tables linked together create a relational database entity equals.
Predicting Earthquakes By Lois Desplat. Why Predict Earthquakes?  To minimize the loss of life and property.  Unfortunately, current techniques do not.
Geographic Information Systems Temporal GIS Lecture 8 Eng. Osama Dawoud.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Optimal insert methods of geographical information to Spatio- temporal DB Final Presentation Industrial Project June 17,2012 Students: Michael Tsalenko.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
Of 24 lecture 11: ontology – mediation, merging & aligning.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Potter’s Wheel: An Interactive Data Cleaning System
Associative Query Answering via Query Feature Similarity
Object Recognition in the Dynamic Link Architecture
Data Mining: Concepts and Techniques Course Outline
CRMarchaeo Modelling Context, Stratigraphic Unit, Excavated Matter
Craig Schroeder October 26, 2004
Database Systems Instructor Name: Lecture-3.
Journal #72 Draw a picture of an earthquake (lithosphere) label the focus, epicenter and fault.
Presentation transcript:

1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn Peterson

2 l Motivation and Goal. l Knowledge Discovery with Subdue. l Application to two Real-World Relational Databases. l Comparison of Subdue with ILP Systems. l Conclusion and Future Work. OUTLINE

3 MOTIVATION AND GOAL l Need to analyze large amounts of information in real world databases. l Information that standard tools can not detect. l Aviation Safety Reporting System Database. l Earthquake Database. l Previous knowledge: Spatio-Temporal relations.

4 THE KDD PROCESS SPECIFIC DOMAINDATA SELECTION DATA SET DATA PREPARATION DATA TRANSFORMATION CLEAN, PREPARED DATA FORMATTED AND STRUCTURED DATA MINING FOUND PATTERNS PATTERN EVALUATION KNOWLEDGE APPLICATION DATA COLLECTION SUBDUE

5 SUBDUE KNOWLEDGE DISCOVERY SYSTEM l SUBDUE discovers patterns (substructures) in structural data sets. l SUBDUE represents data as a labeled graph. l Inputs: Vertices and Edges. l Outputs: Discovered patterns and instances.

6 EXAMPLE object triangle object square on shape Vertices: objects or attributes Edges: relationships 4 instances of

7 l Starts with a single vertex and expand by one edge. l Computationally Constrained Beam Search. l Space is all Sub-graphs of Input Graph. l Guided by Compression Heuristics. SUBDUE’S SEARCH

8 EVALUATION CRITERION l Minimum Encoding. l Graph Compression. l Substructure Size (Tried but did not work).

9 EVALUATION CRITERION MINIMUM DESCRIPTION LENGTH l Minimum Description Length (MDL) principle. The best theory to describe a set of data is the one that minimizes the DL of the entire data set. l DL of the graph: the number of bits necessary to completely describe the graph. l Search for the substructure that results in the maximum compression.

10 THE ASRS DATABASE l The Aviation Safety Reporting System (ASRS). l Reports of incidents that might affect the aviation safety. l Some fields modified or omitted to keep the pilot’s identity confidential. l 72,504 records, with 74 fields each.

11 THE ASRS DATABASE KNOWLEDGE REPRESENTATION EVENT 1 Small_Transport ATC Cockpit Others Land_Plane EVENT 2 EVENT m Near_in_distance Acft_type Detectors Num_engine Surface

12 THE ASRS DATABASE PRIOR KNOWLEDGE l Connections between events where related airports are near to each other. l An airport is near another airport if the distance between them is not more than 200 km. l Spatial relations represented with “near_in_distance” edges.

13 THE ASRS DATABASE RESULTS l Data set: l “CONSEQUENCES”: “ACFT_DAMAGED” or “INJURY”. l “ACFT_TYPE”: “MED_LARGE_TRANSPORT”. l Graph: l 1,053 events, 42,723 vertices, 41,669 directed edges and 18,373 undirected edges. l File size: 2,143,356 bytes.

14 THE ASRS DATABASE RESULTS MINIMUM ENCODING HEURISTIC l Substructure 1 Found with the Minimum Encoding Heuristic with 374 instances. Event Med_Large_Transport Turbojet IFR RetractablePassenger Air_Carrier OccFlight_Crew Land_PlaneLow_Wing Acft_type Crew_ size Engine_typ Flt_plan Lndg_gear Num_engine Operator Mission Report_typ Role Surface Wings Event Med_Large_Transport Turbojet Retractable Air_Carrier Occ Land_PlaneLow_Wing Acft_type Crew_ size Engine_typ Lndg_gear Num_engine Operator Report_typ Surface Wings Near_in_distance

15 THE ASRS DATABASE RESULTS MINIMUM ENCODING HEURISTIC l Substructure 3 Found with the Minimum Encoding Heuristic with 286 instances.

16 THE ASRS DATABASE RESULTS MINIMUM ENCODING HEURISTIC Sub_2Event Near_in_distance l Substructure 4 Found with the Minimum Encoding Heuristic with 67 instances.

17 THE ASRS DATABASE RESULTS MINIMUM ENCODING HEURISTIC l Subdue was able to geographically relate incidents that occurred near to each other and with the same characteristics. l This information is valuable for investigating similar events in a particular region that might be caused for the same reason.

18 THE ASRS DATABASE RESULTS GRAPH COMPRESSION HEURISTIC l Substructure 3: Problem happening in a region determined by the area where the substructures were found. l Substructure 3 interpretation: l Two incidents that happened near to each other. l If airplane identification and complete date and time. l Might find and trace an airplane that failed near one airport, was reported and later had to land close to this first airport due to another failure.

19 THE EARTHQUAKE DATABASE l Several catalogs. l Sources like the National Geophysical Data Center. l Each record with 35 fields describing the earthquake characteristics.

20 THE EARTHQUAKE DATABASE KNOWLEDGE REPRESENTATION

21 THE EARTHQUAKE DATABASE PRIOR KNOWLEDGE l Connections between events whose epicenters were close to each other in distance (<= 75 kilometers). l Connections between events that happened close to each other in time (<= 36 hours). l Spatio-Temporal relations represented with “near_in_distance” and “near_in_time” edges.

22 THE EARTHQUAKE DATABASE RESULTS l Sample of the events that happened in one year. l All the fields in the records were considered. l Graph: l 10,135 events, 136,077 vertices, 125,941 directed edges and 757,417 undirected edges. l Graph file size: 26,963,605 bytes.

23 THE EARTHQUAKE DB RESULTS GRAPH COMPRESSION HEURISTIC l Substructure 8 Found with the Graph Compression Heuristic with 140 instances Sub-1Sub-7 Near_in_time Depth

24 THE EARTHQUAKE DB RESULTS l Graph Compression works faster --> more iterations. l Given enough time MDL could find those substructures. MDL finds substructures using Spatio-Temporal relations. l Subdue found relations with fields like “Catalog”, “Month”, “Mag1 Scale”, and “Depth”. l More earthquakes happened in the months of May and June. l Most frequent earthquake depths were 33 and 10 kilometers.

25 DETERMINING EARTHQUAKE ACTIVITY l Geologist Dr. Burke Burkart. l Study of seismology caused by the Orizaba Fault.

26 l Geologist Dr. Burke Burkart. l Study of seismology caused by the Orizaba Fault. l Fault: A fracture in a surface where a displacement of rocks also happened. l Selection of the area of study, two squares: l First Longitude 94.0W through 101.0W and Latitude 17.0N through 18.0N. l Second Longitude 94.0W through 98.0W and Latitude 18.0N through 19.0N. DETERMINING EARTHQUAKE ACTIVITY

27 DETERMINING EARTHQUAKE ACTIVITY l Divide the area in 44 rectangles of one half of a degree in both longitude and latitude. l Sample the earthquake activity in each sub-area. l Run Subdue in each sub-area.

28 DETERMINING EARTHQUAKE ACTIVITY

29 DETERMINING EARTHQUAKE ACTIVITY l Substructure 1 (with 19 instances) and substructure 2 (with 8 instances) found in sub-area 26.

30 DETERMINING EARTHQUAKE ACTIVITY l This pattern might give us information about the cause of the earthquakes. l Subduction also affects this area but it affects at a specific depth according to the closeness to the Pacific Ocean.

31 SUBDUE’S POTENTIAL l Subdue finds not only shared characteristics of events, but also space relations between them. l Dr. Burke Burkart is studying the patterns to give direction to this research. l Expect to find patterns representing parts of the paths of the involved fault. l Time relations not considered by Subdue. l Earthquake’s characteristics. l Important for other areas.

32 COMPARISON OF SUBDUE WITH ILP SYSTEMS l Inductive Logic Programming (ILP) learn logical relations. l FOIL, GOLEM, PROGOL. l SUBDUE competitive in several domains.

33 CONCEPT LEARNING SUBDUE l ILP systems take positive and negative examples represented with First Order Logic. l New Concept Learning Subdue (CLSubdue) does too. l Can learn multiple rules. l Evaluation is ongoing.

34 CONCLUSION l Subdue successful in real world databases. l Subdue discovered interesting patterns using the temporal and spatial relations. l Subdue found significant patterns in the Orizaba Fault Earthquake Database. l Subdue has potential to compete with ILP systems. l Subdue compared with Progol.

35 FUTURE WORK l Theoretical analysis. l Show Subdue converges to optimal substructure. l Better understanding of search space properties. l Bounds on complexity (e.g. PAC learning). l Graphic User Interface to visualize substructures and their instances. l Express ranges of values (ranges of depth, magnitude, latitude, longitude, etc. in the Earthquake database). l Continue Evalutation in Real-World Spatio-Temporal Databases.