Download presentation
Presentation is loading. Please wait.
Published byIsaac Andrews Modified over 9 years ago
1
Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski 1, Nathalie Colbach 3 1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia Emails: aneta.ivanovska@ijs.si, saso.dzeroski@ijs.sianeta.ivanovska@ijs.sisaso.dzeroski@ijs.si Tel: +386 1 477 3144 (Aneta Ivanovska) 2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium Email: celine.vens@cs.kuleuven.beceline.vens@cs.kuleuven.be 3 UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France Email : colbach@dijon.inra.frcolbach@dijon.inra.fr
2
13/09/2007EnviroInfo 2007 - Warsaw2 The GM problem Genetically Modified (GM) crops First introduced for commercial production in 1996 Herbicide tolerant Pest-resistant Concern: GM crops mixing with conventional or organic crops of the same species
3
13/09/2007EnviroInfo 2007 - Warsaw3 The GM problem (2) Computer simulation model GENESYS Estimates the rate of adventitious presence of GM varieties in non-GM crops Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)
4
13/09/2007EnviroInfo 2007 - Warsaw4 Motivation Predict the contamination of a field with GM material The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006) Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive model Create a relational representation of the problem In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE
5
5 Field plan Crop succession Crop management For each field and year: Plants SeedbankSeeds produced Number per m² Genotypic proportions Rape varieties (Colbach et al., 2001)
6
13/09/2007EnviroInfo 2007 - Warsaw6 Materials and methods: the dataset Output from GENESYS Large-risk field plan maximizes the pollen and seed input into the central field Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan
7
13/09/2007EnviroInfo 2007 - Warsaw7 Materials and methods: the dataset (2) 100 000 simulations for 25 years Attributes: Geometry of the region (field-plan) Genetic variables For each field and year: crops and management techniques Full details kept only for the last 4 years
8
13/09/2007EnviroInfo 2007 - Warsaw8 Materials and methods: relational data mining, relational classification trees Propositional data mining techniques Single table Popular DM techniques: classification and regression decision trees Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees
9
13/09/2007EnviroInfo 2007 - Warsaw9 Materials and methods: relational data mining, relational classification trees (2) Data scattered over multiple relations (or tables): can be analyzed by conventional data mining techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization multi-relational approach takes into account the structure of the original data Data represented in terms of relations: target(Field1, contaminated) Background knowledge is also given
10
13/09/2007EnviroInfo 2007 - Warsaw10 Materials and methods: relational data mining, relational classification trees (3) Relational vs. propositional classification trees - similarities: predict the value of a dependent variable (class) from the values of a set of independent variables (attributes) test in each inner node that tests the value of a certain attribute and compares it with a constant leaf nodes give a classification that applies to all instances that reach the leaf Relational vs. propositional classification trees - differences: Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value Rel. trees: tests can also refer to background knowledge relations or tables
11
13/09/2007EnviroInfo 2007 - Warsaw11 An example of a relational classification tree targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<252 fieldDataYear(FieldA,0,Crop,Sowing Date), SowingDate<233 NEG neighbour(FieldA,FieldB,noborder) and fieldDataYear(FieldB,1,gm-OSR,SowingDate) NEG POS yes no yes no
12
13/09/2007EnviroInfo 2007 - Warsaw12 Experiments and results Representation of the data: Target relation (data label): rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres) Background relations: fieldDataYear(SimulationID, FieldID, Year, CultivationTechniques) lastOSR(SimulationID, FieldID, LastGM, LastNonGM) neighbour(Field1ID, Field2ID, NeighType)
13
13/09/2007EnviroInfo 2007 - Warsaw13 Experiments and results (2) Discretized target attribute – 0.9% Experimental settings: Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used: fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field lastOSR(FieldID,LastGM,LastNonGM), for the target field Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field: neighbour(Field1ID, Field2ID, NeighType)
14
13/09/2007EnviroInfo 2007 - Warsaw14 Experiments and results (3) TILDE’s experimental results – 3-fold cross-validation Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20 Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate) PROPOSITIONALNEIGHBOR TREE SIZE1513 ACCURACY78.35%79.66%
15
13/09/2007EnviroInfo 2007 - Warsaw15 Conclusions Use of relational data mining for analyzing an output of the complex simulation model GENESYS Predict the contamination of the central field of a large-risk field plan Built relational classification trees – first-order decision tree learner TILDE
16
13/09/2007EnviroInfo 2007 - Warsaw16 Conclusions (2) Propositional and relational trees Relational experiments – slightly better Due to using a fixed field plan and a fixed target field Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models
17
13/09/2007EnviroInfo 2007 - Warsaw17 Acknowledgement SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.