Development in the Ferda project December 2006 Martin Ralbovský.

Slides:



Advertisements
Similar presentations
B2B Advertising.
Advertisements

Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
DARPA SHAKEN Virus-Invades-Cell Invade VirusCell Attach Penetrate Release Move invader thing invaded barrier Cell-membrane has-part subevent penetrator.
10 Software Engineering Foundations of Computer Science ã Cengage Learning.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
SYSTEM ANALYSIS & DESIGN (DCT 2013)
T-FLEX DOCs PLM, Document and Workflow Management.
Supporting Design Managing complexity of designing Expressing ideas Testing ideas Quality assurance.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
The Architecture Design Process
1 These courseware materials are to be used in conjunction with Software Engineering: A Practitioner’s Approach, 5/e and are provided with permission by.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Building Knowledge-Driven DSS and Mining Data
Introduction to Software Engineering CS-300 Fall 2005 Supreeth Venkataraman.
SEWEBAR - a Framework for Creating and Dissemination of Analytical Reports from Data Mining Jan Rauch, Milan Šimůnek University of Economics, Prague, Czech.
Y. Rong June 2008 Modified in Feb  Industrial leaders  Initiation of a project (any project)  Innovative way to do: NABC ◦ Need analysis ◦ Approach.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Interpretation and Report Writing. Interpretation & Report Writing After collecting and analyzing the data, the researcher has to accomplish the task.
Software Configuration Management (SCM)
More on Data Mining KDnuggets Datanami ACM SIGKDD
Martin Ralbovský KIZI FIS VŠE The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Copyright 2002 Prentice-Hall, Inc. Chapter 1 The Systems Development Environment 1.1 Modern Systems Analysis and Design.
SE-02 SOFTWARE ENGINEERING LECTURE 3 Today: Requirements Analysis Requirements tell us what the system should do - not how it should do it. Requirements.
Requirements Engineering
Requirements for the Course
CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.
Software Engineering – University of Tampere, CS DepartmentJyrki Nummenmaa REQUIREMENT SPECIFICATION Today: Requirements Specification.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
Report Writing Instructions
Ontology-Driven Data Preparation for Data Mining Martin Zeman, KSI MFF UK Martin Ralbovský, KIZI FIS VŠE.
Ferda Visual Environment for Data Mining Martin Ralbovský.
HW#2: A Strategy for Mining Association Rules Continuously in POS Scanner Data.
1 Introduction to Software Engineering Lecture 1.
Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic.
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8 th April 2004
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Relational extensions for GUHA procedures Alexander Kuzmin
Chapter 4 Automated Tools for Systems Development Modern Systems Analysis and Design Third Edition 4.1.
Written by Changhyun, SON Chapter 5. Introduction to Design Optimization - 1 PART II Design Optimization.
Of An Expert System.  Introduction  What is AI?  Intelligent in Human & Machine? What is Expert System? How are Expert System used? Elements of ES.
1 These courseware materials are to be used in conjunction with Software Engineering: A Practitioner’s Approach, 5/e and are provided with permission by.
Be.wi-ol.de User-friendly ontology design Nikolai Dahlem Universität Oldenburg.
1 These courseware materials are to be used in conjunction with Software Engineering: A Practitioner’s Approach, 5/e and are provided with permission by.
Waqas Haider Khan Bangyal. Organization of the Lecture Research and Methodology: Research defined and described Some classifications of research Define.
1 Requirements Engineering for Agile Methods Lecture # 41.
Computer Systems Architecture Edited by Original lecture by Ian Sunley Areas: Computer users Basic topics What is a computer?
Parallel Patterns.
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
PLM, Document and Workflow Management
Oleh: Beni Setiawan, Wahyu Budi Sabtiawan
Object-Oriented Software Engineering Using UML, Patterns, and Java,
Parts of an Academic Paper
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Tools of Software Development
Need for the subject.
Introduction to Systems Analysis and Design Stefano Moshi Memorial University College System Analysis & Design BIT
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

Development in the Ferda project December 2006 Martin Ralbovský

Content  History  Changes in the 2.0 version, improved GUHA abilities  Background knowledge and ontologies  Further academic development

Ferda project history I  Ferda – successor of the LISp-Miner data mining system, visual and modular environment  Software project at MFF UK  KEG  Introduction of the system  Description of parts of the working environment  Implementation principles  Znalosti 2006 article  KEG  State of development in May 06  Master theses themes discussed

Ferda project history II Development since May 06  “Experimental GUHA Procedures” by Tomáš Kuchař completed  “Usage of Domain Knowledge for Applications of GUHA Procedures” by Martin Ralbovský completed  Further development + testing

Available versions of Ferda  Version 1.0 (1.1) - approved MFF project version (+ improvements) Copy of the LISp-Miner system in terms of GUHA abilities (almost) Dependent on the LISp-Miner hypotheses generation engine  Version 2.0 based on the master thesis of Tomáš Kuchař Ferda no longer dependent on LISp-Miner system Improved GUHA abilities (datasource, definition of relevant questions…)

Improved GUHA abilities theoretically I Definition of a large set of relevant questions (original):  Attribute A,  non-empty subset of attribute , then A() is basic boolean attribute  Each basic boolean attribute is a boolean attribute  If  and  are boolean attributes, then   and  are boolean attributes

Improved GUHA abilities theoretically II Definition of a large set of relevant questions in LISp-Miner (and Ferda 1.0)  Literal ~ basic boolean attribute or its negation  Literal can be basic or remaining basic – in each partial cedent there has to be at least one basic literal remaining – the opposite  Partial cedent ~ conjunction of literals  Cedent ~ conjunction of partial cedents

Improved GUHA abilities theoretically III Definition of a large set of relevant questions in Ferda 2.0  Ferda 2.0 fully supports the original definition, user can use conjunction, disjunction and negation multiple times  Basic boolean attribute can be  Basic – the same meaning  Forced – must be present in every relevant question  Auxiliary – conjunction and disjunction cannot be formed only with auxiliary boolean attributes (there must be a basic or forced attribute).

Improved GUHA abilities practically 4FT – Ferda 1.0

Improved GUHA abilities practically 4FT – Ferda 2.0

Improved GUHA abilities practically KL – Ferda 1.0

Improved GUHA abilities practically KL – Ferda 2.0

Ferda 2.0 versus LISp-Miner We compare only the hypotheses generation engines, not the whole systems  Running time of procedures  4FT approximately equal  KL faster in Ferda 2.0  CF faster in Ferda 2.0  SD procedures much faster in LISp-Miner (no jump optimalizations)  Some quantifiers not implemented in Ferda 2.0 (but are easy to implement)  LISp-Miner better tested

Background knowledge I – introduction  Background knowledge is a vague term for knowledge from the domain experts to aid in KDD.  No central definition or theory, different authors use it differently.  The definition for GUHA mining: a set of various verbal rules that are accepted in a specific domain as a common knowledge.  Background knowledge can be used as an effective mean of communication between the knowledge expert and the data miner.  Usage of background knowledge in GUHA is described in master thesis of Martin Ralbovsky (and elsewhere)

Background knowledge II - examples Sociomedical domain:  If education increases, wine consumption increases as well  Patients with greater responsibility in work tend to drive to work by car Beer marketing domain:  Younger consumers prefer drought beer  Older consumers prefer beer in bottles  More expensive brands are better sold during holidays

Background knowledge III – preferred usage Domain expert Data miner Knowledge about the domain Data mining techniques and interpretation knowledge Specification of interesting facts to the domain expert Rules can be transformed into mining tasks Tasks results Soundness of DM techniques

Background knowledge IV – in Ferda  Formalization of background knowledge rules sound for GUHA purposes created  Implemented modules of the Ferda system (version 1.1) to validate background knowledge rules  Experiments carried out to find presence of background knowledge rules in the data with the GUHA procedures 4FT and KL  So far rather disappointing results

Background knowledge V - experiment Presumptions:  Background knowledge rules are somehow stored in the data  Data collection and attribute creation without mistakes Question: Can the rules be found in data with “our” techniques? Experiment: 8 background knowledge rules tested with the 4FT and KL

Background knowledge VI - results  Founded Implication with default values (base = 0,05, p = 0,95) – 1/8 rules approved  Above Average with default values (base= 0,05, P = 1,2) – 1/8 rules approved  Modifications of Kendall – 2/6 rules approved  Furthermore quantifiers showed strange results (4/8 FI results below with p below 0,4)  How good are our quantifiers???  Bigger experiments are planned to be done in the future

Ontologies I – introduction  In the past attempts to enhance GUHA mining with domain ontologies (also presented on KEG)  Data understanding  Attribute creation  Decomposition of tasks  Task creation  Ralbovský’s master thesis first work to examine automatic processing of domain ontologies  Deep analysis, however no tools implemented

Ontologies II – problems Technical problems… not so bad Conceptual problems  Ontologies express knowledge on very general level  For GUHA mining, we need specific knowledge that usually is not present in ontologies Example: for attribute creation we need  Maximum and minimum values  Extreme values  Significant values dividing the domain  Typical values (for nominal domains) Solution: probably specific ontologies for GUHA mining

Further academic development I Alexander Kuzmin – “Relational GUHA procedures” master thesis  Implementation of relational 4FT miner (and possibly others)  Ferda 2.0, spring 2007 Daniel Kupka – “User support for 4ft-Miner procedure for data mining” master thesis  Help scenarios depending on the settings of 4FT task  Complex and modular system  Ferda 2.0, spring 2007

Further academic development II Martin Zeman – “Using ontologies in GUHA procedures”  Definition of GUHA ontologies  Tools for ontology support  Ferda 2.0, autumn 2006 Michal Kováč – “User oriented language for solving KDD tasks”  Only Michal knows what this is about  Ferda 2.0, autumn 2006

Thank you for your attention.