Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Data Mining Without Information on Knowledge Structure

Similar presentations


Presentation on theme: "Towards Data Mining Without Information on Knowledge Structure"— Presentation transcript:

1 Towards Data Mining Without Information on Knowledge Structure
Alexandre Vautier, Marie-Odile Cordier and René Quiniou Université de Rennes 1 INRIA Rennes - Bretagne Atlantique I’m to going to present the paper entitled “Towards Data Mining Without Information on Knowledge Structure”. Wednesday, September 19th 2007

2 Usual KD Process User needs: A data mining task Domain knowledge
Interpretation/ Evaluation Data Mining Knowledge Transformation Models Preprocessing Selection Transformed Data The picture shows the classical view of the knowledge discovery process. It is composed of a sequence of operations that are executed on the data: selection, preparation, mining and, finally, model interpretation and evaluation. If the evaluation is not acceptable, the user comes back to a previous step and reiterates the computations until being satisfied. He can try different data formats, execute different algorithms, evaluate different types of model, e.g. clusters, association rules, etc. However, this means that the user possess some knowledge about his data and the knowledge he wants to extract. [click] Preprocessed Data Target Data Data

3 What can a user extract from data without domain knowledge ?
Usual KD Process û User needs: A data mining task Domain knowledge Interpretation/ Evaluation Data Mining Knowledge Transformation Models Preprocessing Selection Transformed Data [click] But, what if he has very weak knowledge about his data, or no knowledge at all? Well, he iterates more or less blindly computations on his data with no insurance that he gets close to interesting extracted knowledge. Preprocessed Data Target Data Data What can a user extract from data without domain knowledge ?

4 Application context Network Alarms
Represent network alarms Understand network behavior Detect new DDoS attacks An alarm is composed of A directed link between two IP addresses A date A severity (low,med,high) (related to the link rate) To be more precise let’s present an illustration from an application related to network security. In order to predict and prevent DDoS attacks, experts are looking for a better understanding of the network behavior. One way, is to analyze suspicious packets conveyed by network routers stored in alarm logs. Though such logs contain interesting and heterogeneous information, they are extremely large and it is very difficult for experts to extract useful knowledge from them. Precisely, an expert would be interested by the spatial and temporal relations structuring the data. The picture shows such relations: - on the spatial view, an edge represent an alarm between two network nodes: a high severity is indicated in red and a low severity in green. We can see that few nodes are very connected that constitute an interesting information. [click]

5 Application context Network Alarms
Represent network alarms Understand network behavior Detect new DDoS attacks An alarm is composed of A directed link between two IP addresses A date A severity (low,med,high) (related to the link rate) [click] - on the temporal view, we can see that an alarm pattern occurred several times. So, the goal of a mining framework would be to help the expert locate and extract patterns related to interesting situations from a voluminous alarm log.

6 Application context Network Alarms
Generalized links: M1 = { ! *, * ! ,…} Sequences M2 = {1.5.5.* ! * > * ! ,…} Clustering on date and severity M3 = {{ 11/01/05…11/03/05, low}, { 11/07/05…11/15/05, high}} Models Here are three different models that could be extracted from alarm logs: - M1 is a set of generalized links: it represents spatial information by abstracting IP addresses - M2 is a set of sequences: it represents temporal information by alarm sequences involving sets of nodes represented by abstracted IP addresses - M3 is a set of clusters: it represents alarm properties, here related to occurrence date and severity. Any of these models represent relevant information but which is the best ? So, the questions are: - how can an expert select the relevant data without any knowledge about attack patterns? - how can he select the relevant model that could explain at best a suspicious behavior related to an attack? - how can he select the relevant mining algorithms that would extract the models above? Alarms Data Mining Algorithms

7 Objectives Goal : search models that fit the given data
Current assumption: the user has sufficient knowledge to define the type of model choose the relevant DM algorithm Our proposition: alleviate the current assumption by executing automatically DM algorithms to extract models from data evaluating the resulting models in a generic manner to propose to the user the “best suited” model(s) Without any information on the knowledge that should be extracted, we can only assume that the extracted model should fit the data. Fit means that the model should not be too general to be informative enough but not too specific in order to abstract the data sufficiently . A common assumption made in data mining, is that the user has sufficient knowledge to define the structure of the model and to choose an algorithm and hope that the result will be interesting. The model can be a set of sequences, of generalized links or a clustering as exposed juts before Or a decision tree, a set of itemsets, etc... Our aim is to alleviate this assumption. In the proposed framework, we propose to execute automatically DM algorithms on data and then to rank the resulting models by a generic evaluation. The evaluation has to be generic since different types of models must be compared.

8 Framework DM algorithm specifications „Model extraction
Here is a picture of our framework. On the one side we have a set of DM algorithms specifications. On the other side we have a specification of the data. The data mining process is decomposed as follows. The first step selects the algorithm that could be performed on the data. This is achieved by unifying the respective specifications of data and algorithms. This operation can adapt the algorithm to data features. In a second step the selected adapted algorithms are executed on the data. In a third step the covering relation relating the extracted models and the given data is analyzed in order to evaluate the quality of the model. The quality of the model is related to the complexity of the data with respect to the model and a covering relation Finally, the models are ranked according to their complexity. DM algorithm specifications ‚Data Specification ƒUnification of specifications „Model extraction …Generic evaluation †Model ranking

9 Schemas for specification
Enhanced algebraic specifications (Types, operations and equations) Category theory [Mac Lane 1942] Sketch [Ehresmann 1965] Use specification inheritance The key concept of our approach is based on the concept of schema. Which is used for specification. Schemas are close to algebraic specifications, that associate types, operations and equations. A schema is related to the category theory and more precisely to the sketch theory by Ehresmann. Several constructions have been introduced to cope with DM features. Don’t be afraid : Only a subset of the category theory is used for defining schemas. A schema is an operational specification. Indeed, a general specification mechanism is needed to generalize the structure behind the DM algorithms but an efficient implementation framework is also needed. As in algebraic specifications, we make a strong use of inheritance between specifications to reuse DM algorithms.

10 Data specification Network Alarm Schema
Node: a type Edge: A function A relation Green dotted edge: projection ) Cartesian product Red dashed edge: inclusion ) union Now, I illustrate the concept of schema on the specification of the alarm data presented earlier. A node represents a type. An edge represents an operation which can be a function or a relation. For example the node L(alarm) represents the type list of alarms, the exa operation represents the member relation. The node 1 represents the type void. And the relations exiting from this node may be viewed as constants. So, the edge between 1 and date represents the constants of type date. The edge color has the following meaning: The green edges represent Cartesian products. For instance, the type alarm is the Cartesian product of the types date, link and severity. The red edges define unions of types. A example of union will be shown in the next slide. [click]

11 Data specification Network Alarm Schema
Node: a type Edge: A function A relation Green dotted edge: projection ) Cartesian product Red dashed edge: inclusion ) union [click] Furthermore, by using inheritance, more exactly schema morphism, we can add a semantic information to the type L(alarm) to indicate that this node represents the type of the data to be mine.

12 DM Algorithm specification Generalized edges
After an example of data specification, I present an example of algorithm specification. This schema is a bit more complex but not too much. Here we give the specification of an algorithm that extracts generalized links to express the high connectivity between some nodes in a network. A graph is represented by a list of edges and an edge is represented by two nodes, respectively the source and the target. A generalized graph is also represented by a list of edges and an edge is represented by two nodes, whose type is edgeG (generalized edge) which is the union of type edge (normal node) and the type of abstracted nodes (represented by the star symbol). To express the fact that a graph generalizes an other graph, we introduce the notion of covering relation. A covering relation makes explicit the link between a model and the data covered by this model. In our example, the covering relation is expressed by the path cle from type L(edgeG) to type edge. cle is composed of the sub covering relations cn and ce respectively for the nodes and the edges. [click]

13 DM Algorithm specification Generalized edges
Model type Covering relation [click] Finally by using inheritance we add semantic information to the types model and data. As well as to the two most important edges : the edge mine_graph that represents the mining algorithm and the edge cle that represents the covering relation.

14 Schema unification ? An operational schema is computed by unifying an algorithm schema an a data schema. This is performed automatically by the system. For example, the two previous schemas can be unified. [click]

15 ? Schema unification Data Type Abstract Data Type [click]
First, the types L(edge) and L(alarm) have to be unified since they represent data. This means that the types link and edge on the one hand And the types actor and node on the other hand must also be unified. This leads to the insertion of the node alarm between the nodes L(edge) and edge

16 ? Unification of Schema Data Type Abstract Data Type [click]
Due to lack of space, we have not represented all the types in the unification schema. A very important point to notice is that the covering relation has been automatically rewritten in order to take into account forgotten attributes, such as the date and the severity. We can also notice that the unification of two schemas is not unique. For example, source and target could be reversed.

17 Framework DM algorithm specifications „Model extraction
Up to now , we have described the concept of schema and we have explained the first processing step of our framework. The next step is not detailed here since it is related to the data mining algorithm that is used. In the next slides, we will detail the generic evaluation. DM algorithm specifications ‚Data Specification ƒUnification of specifications „Model extraction …Generic evaluation †Model ranking

18 Cf(x) = min { s(p) | f(p) = x }
Generic evaluation Compare different kinds of model Inspired by Kolmogorov complexity The complexity of an object x is the size s(p) of the shortest program p that outputs x executed on a universal machine f Cf(x) = min { s(p) | f(p) = x } The goal of the evaluation step is to assess the relevance of the extracted models. In our opinion, this notion of relevance is related to the notion of information complexity: The more the model can summarize the information the more it is relevant to adequately abstract the data. The generic evaluation is inspired by the Kolmogorov complexity, more precisely by the MDL principle. Informally, the Kolmogorov complexity of an object is the size of the shortest program that outputs the object. The Kolmogorov complexity is good way to compare models because It is general and does not assume a particular goal It can be used to compare different types of models.

19 Generic evaluation Complexity of data d in a schema S relatively to a model m (c: M $ D) : complexity of K(d,m,S) = k(M) the model structure + k(D) the data structure + k(c) the covering relation + k(m|M) the model + k(d|m,c,D) the data knowing … In our context, to assess how the model fits the data. we have to evaluate how a model m associated to a covering relation c covers some data d. To achieve this goal, we use the minimum description length approach and we decompose the program into several parts : - The size of the model structure (the more it needs nodes and relations, the more it is high) The size of the data structure which is the same as the model structure The complexity of the covering relation related to the number of basic operations it uses The size of the model corresponds approximately to the number of bits needed to code this model And the most important part: the size of data indexing knowing the covering relation, the model and the data type The data type is important since for data not covered by a model the data index has to be built from scratch. So our framework has to find a way to minimize the total complexity. The first four sub expression are fixed. Only the last part can be optimized and the way to perform this optimization is to decompose the covering relation. This process is briefly described in the next slide.

20 Path Indexing Covering Relation Decomposition
Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) The covering relation, relates two types M and D, a model m and a dataset d. The relation c covers the data belonging to the set c(m) so we can compute an index by indexing the elements from the intersection of d with c(m) and next indexing the elements of d that are outside c(m). [click]

21 Path Indexing Covering Relation Decomposition
Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) Decomposition relying on relation composition c = s ± t: M $ D t: M $ A s: A $ D d [click] In an operational schema, the covering relation can be rewritten in a covering relation which is a composition of several relations. Thus, the relation c can be decomposed as shown in bottom part. Here, c can be defined as the composition of s with t. m t(m) M A c(m) = s ± t(m) D

22 Path Indexing Covering Relation Decomposition
Null Decomposition c(m) c: M $ D m d M D k(d|m,c,D) = k(d|c(m)) + k(d\c(m)|D) Decomposition relying on relation composition c = s ± t: M $ D s(a) t: M $ A a s: A $ D d [click] The intermediate set a is important since from it we can find the set s(a) and from that set it is easier to find the dataset d. As a result, the complexity of the data is - the complexity to find the set a in t(m) - plus the complexity to find d in the the set s(a). m t(m) M A c(m) = s ± t(m) D k(d|m, s ± t ,D) = k(a|t(m)) + k(d|s(a)) + k(d\s(a)|D)

23 Experiments Extraction of clusters, generalized edges, and sequences
Dataset: alarms Duration: 400 seconds (without DM algorithm duration) 6 operational algorithms Experiments on datasets generated by models Network alarm from real network We have experimented our framework for knowledge discovery from several datasets. First, we ran the framework on several randomly generated sets for improving its computation time efficiency. Currently it is able to process alarms in 400 seconds. Secondly to validate our method we are currently working on a new kind of experiment: Experimental Data are generated from a specific model and then these data are processed by our framework And finally the extracted models are compared to the expected models Finally, we have processed real data provided by the French telecommunication operator France-Telecom. The task was to extract relevant models that are visualized as in the introductory examples.

24 Discussions Unification : Generic evaluation
Exponential in time with respect to the number of nodes in a schema Generic evaluation Linear in time and space Adapt the evaluation method User defined According to a model visualization According to local data instead of global data A few words about the computational complexity of the approach. The complexity of unification can be exponential. For example, the worst case occurs when there are no edge and node properties specified in the two schemas to be unified, In this case, every pair of nodes can be matched, and so, the number of possible unifications is exponential. But in practice, the user should be clever enough to provide good specifications in order to avoid such drawbacks. The evaluation is linear in time and space which means it is tractable for medium size datasets. But large datasets cannot fit in memory. Furthermore in order to use the KD process as an iterative and interactive process We need to provide several means to evaluate models. These means are related to visualization tools or data localizations tools.

25 What do schemas bring to Data Mining ?
Describe data and DM algorithms with a common language Allow to unify data structure with DM algorithms input Provide a way to compute the model complexity relatively to a type in a schema Provide a way to compute the data complexity relatively to A model A covering relation and its decomposition Are implementable in an efficient manner As a conclusion… What do schemas bring to Data Mining ? They give a mean to …

26 Towards Data Mining Without Information on Knowledge Structure
Alexandre Vautier, Marie-Odile Cordier and René Quiniou INRIA Rennes - Bretagne Atlantique Université de Rennes 1 Thank you !


Download ppt "Towards Data Mining Without Information on Knowledge Structure"

Similar presentations


Ads by Google