Towards Data Mining Without Information on Knowledge Structure

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advanced Piloting Cruise Plot.
Analysis of Computer Algorithms
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Kapitel 21 Astronomie Autor: Bennett et al. Galaxienentwicklung Kapitel 21 Galaxienentwicklung © Pearson Studium 2010 Folie: 1.
1 End-User Programming to Support Classroom Activities on Small Devices Craig Prince University of Washington VL/HCC 2008.
Generative Design in Civil Engineering Using Cellular Automata Rafal Kicinger June 16, 2006.
Chapter 1 The Study of Body Function Image PowerPoint
1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Network Monitoring System In CSTNET Long Chun China Science & Technology Network.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
An Inductive Database for Mining Temporal Patterns in Event Sequences Alexandre Vautier, Marie-Odile Cordier and René Quiniou
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
1 Automating the Generation of Mutation Tests Mike Papadakis and Nicos Malevris Department of Informatics Athens University of Economics and Business.
ZMQS ZMQS
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
ABC Technology Project
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Squares and Square Root WALK. Solve each problem REVIEW:
We are learning how to read the 24 hour clock
Do you have the Maths Factor?. Maths Can you beat this term’s Maths Challenge?
YO-YO Leader Election Lijie Wang
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 4 Slide 1 Software processes 2.
Executional Architecture
Chapter 5 Test Review Sections 5-1 through 5-4.
SIMOCODE-DP Software.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Januar MDMDFSSMDMDFSSS
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
A SMALL TRUTH TO MAKE LIFE 100%
1 Unit 1 Kinematics Chapter 1 Day
PSSA Preparation.
How Cells Obtain Energy from Food
1 October 16 th, 2009 Meaning to motion: Transforming specifications to provably-correct control Hadas Kress-Gazit Cornell University George Pappas University.
From Model-based to Model-driven Design of User Interfaces.
Presentation transcript:

Towards Data Mining Without Information on Knowledge Structure Alexandre Vautier, Marie-Odile Cordier and René Quiniou Université de Rennes 1 INRIA Rennes - Bretagne Atlantique I’m to going to present the paper entitled “Towards Data Mining Without Information on Knowledge Structure”. Wednesday, September 19th 2007

Usual KD Process User needs: A data mining task Domain knowledge Interpretation/ Evaluation Data Mining Knowledge Transformation Models Preprocessing Selection Transformed Data The picture shows the classical view of the knowledge discovery process. It is composed of a sequence of operations that are executed on the data: selection, preparation, mining and, finally, model interpretation and evaluation. If the evaluation is not acceptable, the user comes back to a previous step and reiterates the computations until being satisfied. He can try different data formats, execute different algorithms, evaluate different types of model, e.g. clusters, association rules, etc. However, this means that the user possess some knowledge about his data and the knowledge he wants to extract. [click] Preprocessed Data Target Data Data

What can a user extract from data without domain knowledge ? Usual KD Process û User needs: A data mining task Domain knowledge Interpretation/ Evaluation Data Mining Knowledge Transformation Models Preprocessing Selection Transformed Data [click] But, what if he has very weak knowledge about his data, or no knowledge at all? Well, he iterates more or less blindly computations on his data with no insurance that he gets close to interesting extracted knowledge. Preprocessed Data Target Data Data What can a user extract from data without domain knowledge ?

Application context Network Alarms Represent network alarms Understand network behavior Detect new DDoS attacks An alarm is composed of A directed link between two IP addresses A date A severity (low,med,high) (related to the link rate) To be more precise let’s present an illustration from an application related to network security. In order to predict and prevent DDoS attacks, experts are looking for a better understanding of the network behavior. One way, is to analyze suspicious packets conveyed by network routers stored in alarm logs. Though such logs contain interesting and heterogeneous information, they are extremely large and it is very difficult for experts to extract useful knowledge from them. Precisely, an expert would be interested by the spatial and temporal relations structuring the data. The picture shows such relations: - on the spatial view, an edge represent an alarm between two network nodes: a high severity is indicated in red and a low severity in green. We can see that few nodes are very connected that constitute an interesting information. [click]

Application context Network Alarms Represent network alarms Understand network behavior Detect new DDoS attacks An alarm is composed of A directed link between two IP addresses A date A severity (low,med,high) (related to the link rate) [click] - on the temporal view, we can see that an alarm pattern occurred several times. So, the goal of a mining framework would be to help the expert locate and extract patterns related to interesting situations from a voluminous alarm log.

Application context Network Alarms Generalized links: M1 = {192.168.2.1 ! *, * ! 192.168.2.5,…} Sequences M2 = {1.5.5.* ! 2.2.3.* > 2.2.3.* ! 1.2.3.4 ,…} Clustering on date and severity M3 = {{ 11/01/05…11/03/05, low}, { 11/07/05…11/15/05, high}} Models Here are three different models that could be extracted from alarm logs: - M1 is a set of generalized links: it represents spatial information by abstracting IP addresses - M2 is a set of sequences: it represents temporal information by alarm sequences involving sets of nodes represented by abstracted IP addresses - M3 is a set of clusters: it represents alarm properties, here related to occurrence date and severity. Any of these models represent relevant information but which is the best ? So, the questions are: - how can an expert select the relevant data without any knowledge about attack patterns? - how can he select the relevant model that could explain at best a suspicious behavior related to an attack? - how can he select the relevant mining algorithms that would extract the models above? Alarms Data Mining Algorithms

Objectives Goal : search models that fit the given data Current assumption: the user has sufficient knowledge to define the type of model choose the relevant DM algorithm Our proposition: alleviate the current assumption by executing automatically DM algorithms to extract models from data evaluating the resulting models in a generic manner to propose to the user the “best suited” model(s) Without any information on the knowledge that should be extracted, we can only assume that the extracted model should fit the data. Fit means that the model should not be too general to be informative enough but not too specific in order to abstract the data sufficiently . A common assumption made in data mining, is that the user has sufficient knowledge to define the structure of the model and to choose an algorithm and hope that the result will be interesting. The model can be a set of sequences, of generalized links or a clustering as exposed juts before Or a decision tree, a set of itemsets, etc... Our aim is to alleviate this assumption. In the proposed framework, we propose to execute automatically DM algorithms on data and then to rank the resulting models by a generic evaluation. The evaluation has to be generic since different types of models must be compared.

Framework DM algorithm specifications „Model extraction Here is a picture of our framework. On the one side we have a set of DM algorithms specifications. On the other side we have a specification of the data. The data mining process is decomposed as follows. The first step selects the algorithm that could be performed on the data. This is achieved by unifying the respective specifications of data and algorithms. This operation can adapt the algorithm to data features. In a second step the selected adapted algorithms are executed on the data. In a third step the covering relation relating the extracted models and the given data is analyzed in order to evaluate the quality of the model. The quality of the model is related to the complexity of the data with respect to the model and a covering relation Finally, the models are ranked according to their complexity. DM algorithm specifications ‚Data Specification ƒUnification of specifications „Model extraction …Generic evaluation †Model ranking

Schemas for specification Enhanced algebraic specifications (Types, operations and equations) Category theory [Mac Lane 1942] Sketch [Ehresmann 1965] Use specification inheritance The key concept of our approach is based on the concept of schema. Which is used for specification. Schemas are close to algebraic specifications, that associate types, operations and equations. A schema is related to the category theory and more precisely to the sketch theory by Ehresmann. Several constructions have been introduced to cope with DM features. Don’t be afraid : Only a subset of the category theory is used for defining schemas. A schema is an operational specification. Indeed, a general specification mechanism is needed to generalize the structure behind the DM algorithms but an efficient implementation framework is also needed. As in algebraic specifications, we make a strong use of inheritance between specifications to reuse DM algorithms.

Data specification Network Alarm Schema Node: a type Edge: A function A relation Green dotted edge: projection ) Cartesian product Red dashed edge: inclusion ) union Now, I illustrate the concept of schema on the specification of the alarm data presented earlier. A node represents a type. An edge represents an operation which can be a function or a relation. For example the node L(alarm) represents the type list of alarms, the exa operation represents the member relation. The node 1 represents the type void. And the relations exiting from this node may be viewed as constants. So, the edge between 1 and date represents the constants of type date. The edge color has the following meaning: The green edges represent Cartesian products. For instance, the type alarm is the Cartesian product of the types date, link and severity. The red edges define unions of types. A example of union will be shown in the next slide. [click]

Data specification Network Alarm Schema Node: a type Edge: A function A relation Green dotted edge: projection ) Cartesian product Red dashed edge: inclusion ) union [click] Furthermore, by using inheritance, more exactly schema morphism, we can add a semantic information to the type L(alarm) to indicate that this node represents the type of the data to be mine.

DM Algorithm specification Generalized edges After an example of data specification, I present an example of algorithm specification. This schema is a bit more complex but not too much. Here we give the specification of an algorithm that extracts generalized links to express the high connectivity between some nodes in a network. A graph is represented by a list of edges and an edge is represented by two nodes, respectively the source and the target. A generalized graph is also represented by a list of edges and an edge is represented by two nodes, whose type is edgeG (generalized edge) which is the union of type edge (normal node) and the type of abstracted nodes (represented by the star symbol). To express the fact that a graph generalizes an other graph, we introduce the notion of covering relation. A covering relation makes explicit the link between a model and the data covered by this model. In our example, the covering relation is expressed by the path cle from type L(edgeG) to type edge. cle is composed of the sub covering relations cn and ce respectively for the nodes and the edges. [click]

DM Algorithm specification Generalized edges Model type Covering relation [click] Finally by using inheritance we add semantic information to the types model and data. As well as to the two most important edges : the edge mine_graph that represents the mining algorithm and the edge cle that represents the covering relation.

Schema unification ? An operational schema is computed by unifying an algorithm schema an a data schema. This is performed automatically by the system. For example, the two previous schemas can be unified. [click]

? Schema unification Data Type Abstract Data Type [click] First, the types L(edge) and L(alarm) have to be unified since they represent data. This means that the types link and edge on the one hand And the types actor and node on the other hand must also be unified. This leads to the insertion of the node alarm between the nodes L(edge) and edge

? Unification of Schema Data Type Abstract Data Type [click] Due to lack of space, we have not represented all the types in the unification schema. A very important point to notice is that the covering relation has been automatically rewritten in order to take into account forgotten attributes, such as the date and the severity. We can also notice that the unification of two schemas is not unique. For example, source and target could be reversed.

Framework DM algorithm specifications „Model extraction Up to now , we have described the concept of schema and we have explained the first processing step of our framework. The next step is not detailed here since it is related to the data mining algorithm that is used. In the next slides, we will detail the generic evaluation. DM algorithm specifications ‚Data Specification ƒUnification of specifications „Model extraction …Generic evaluation †Model ranking

Cf(x) = min { s(p) | f(p) = x } Generic evaluation Compare different kinds of model Inspired by Kolmogorov complexity The complexity of an object x is the size s(p) of the shortest program p that outputs x executed on a universal machine f Cf(x) = min { s(p) | f(p) = x } The goal of the evaluation step is to assess the relevance of the extracted models. In our opinion, this notion of relevance is related to the notion of information complexity: The more the model can summarize the information the more it is relevant to adequately abstract the data. The generic evaluation is inspired by the Kolmogorov complexity, more precisely by the MDL principle. Informally, the Kolmogorov complexity of an object is the size of the shortest program that outputs the object. The Kolmogorov complexity is good way to compare models because It is general and does not assume a particular goal It can be used to compare different types of models.

Generic evaluation Complexity of data d in a schema S relatively to a model m (c: M $ D) : complexity of K(d,m,S) = k(M) the model structure + k(D) the data structure + k(c) the covering relation + k(m|M) the model + k(d|m,c,D) the data knowing … In our context, to assess how the model fits the data. we have to evaluate how a model m associated to a covering relation c covers some data d. To achieve this goal, we use the minimum description length approach and we decompose the program into several parts : - The size of the model structure (the more it needs nodes and relations, the more it is high) The size of the data structure which is the same as the model structure The complexity of the covering relation related to the number of basic operations it uses The size of the model corresponds approximately to the number of bits needed to code this model And the most important part: the size of data indexing knowing the covering relation, the model and the data type The data type is important since for data not covered by a model the data index has to be built from scratch. So our framework has to find a way to minimize the total complexity. The first four sub expression are fixed. Only the last part can be optimized and the way to perform this optimization is to decompose the covering relation. This process is briefly described in the next slide.

Path Indexing Covering Relation Decomposition Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) The covering relation, relates two types M and D, a model m and a dataset d. The relation c covers the data belonging to the set c(m) so we can compute an index by indexing the elements from the intersection of d with c(m) and next indexing the elements of d that are outside c(m). [click]

Path Indexing Covering Relation Decomposition Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) Decomposition relying on relation composition c = s ± t: M $ D t: M $ A s: A $ D d [click] In an operational schema, the covering relation can be rewritten in a covering relation which is a composition of several relations. Thus, the relation c can be decomposed as shown in bottom part. Here, c can be defined as the composition of s with t. m t(m) M A c(m) = s ± t(m) D

Path Indexing Covering Relation Decomposition Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) Decomposition relying on relation composition c = s ± t: M $ D s(a) t: M $ A a s: A $ D d [click] The intermediate set a is important since from it we can find the set s(a) and from that set it is easier to find the dataset d. As a result, the complexity of the data is - the complexity to find the set a in t(m) - plus the complexity to find d in the the set s(a). m t(m) M A c(m) = s ± t(m) D k(d|m, s ± t ,D) = k(a|t(m)) + k(d|s(a)) + k(d\s(a)|D)

Experiments Extraction of clusters, generalized edges, and sequences Dataset: 10.000 alarms Duration: 400 seconds (without DM algorithm duration) 6 operational algorithms Experiments on datasets generated by models Network alarm from real network We have experimented our framework for knowledge discovery from several datasets. First, we ran the framework on several randomly generated sets for improving its computation time efficiency. Currently it is able to process 10.000 alarms in 400 seconds. Secondly to validate our method we are currently working on a new kind of experiment: Experimental Data are generated from a specific model and then these data are processed by our framework And finally the extracted models are compared to the expected models Finally, we have processed real data provided by the French telecommunication operator France-Telecom. The task was to extract relevant models that are visualized as in the introductory examples.

Discussions Unification : Generic evaluation Exponential in time with respect to the number of nodes in a schema Generic evaluation Linear in time and space Adapt the evaluation method User defined According to a model visualization According to local data instead of global data A few words about the computational complexity of the approach. The complexity of unification can be exponential. For example, the worst case occurs when there are no edge and node properties specified in the two schemas to be unified, In this case, every pair of nodes can be matched, and so, the number of possible unifications is exponential. But in practice, the user should be clever enough to provide good specifications in order to avoid such drawbacks. The evaluation is linear in time and space which means it is tractable for medium size datasets. But large datasets cannot fit in memory. Furthermore in order to use the KD process as an iterative and interactive process We need to provide several means to evaluate models. These means are related to visualization tools or data localizations tools.

What do schemas bring to Data Mining ? Describe data and DM algorithms with a common language Allow to unify data structure with DM algorithms input Provide a way to compute the model complexity relatively to a type in a schema Provide a way to compute the data complexity relatively to A model A covering relation and its decomposition Are implementable in an efficient manner As a conclusion… What do schemas bring to Data Mining ? They give a mean to …

Towards Data Mining Without Information on Knowledge Structure Alexandre Vautier, Marie-Odile Cordier and René Quiniou INRIA Rennes - Bretagne Atlantique Université de Rennes 1 Thank you !