Download presentation
Presentation is loading. Please wait.
Published byLaura Shields Modified over 6 years ago
1
PKDD Discovery Challenge (not only) on Financial Data
Petr Berka Laboratory for Intelligent Systems University of Economics, Prague
2
Cups, Challenges, Competitions
KDD Cups (since 1997) KDD Sisyphus at ECML 1998 PKDD Discovery Challenges (since 1999) COIL Competition 2000 PAKDD Challenge 2000 PT Challenge 2000, 2001 JSAI KDD Challenge 2001 EUNITE Competition 2001, 2002 . . . Bold typed – I participated or was involved in organizing DMLL Workshop, ICML Petr Berka, LISp, 2002
3
PKDD Discovery Challenge Idea
Realistic data mining conditions collaborative rather then competitive nature rather vague specification of the problem Differences to real KDD projects short time for analysis (2-3 months) only indirect access to domain and data experts during KDD process The idea originates from Jan Zytkow who suggested to organize at PKDD99 in Prague a little bit different event than are the KDD Cups. Ideal contribution should describe the goals of the challenge (in business terms), the method used (preprocessing, data mining) and the achieved results. 2-3 months is enough for building a model to classify a singe table but is too short to understand a complex domain and the data collected. DMLL Workshop, ICML Petr Berka, LISp, 2002
4
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Challenge Settings Data and their full description available on the web for all participants Submissions evaluated by domain experts (but no ordering, no winners and losers) Workshop at PKDD to present the results and discus them with domain experts Results and comments of experts available on the web (after the workshop) Web based access – registering only to keep track of interested people. DMLL Workshop, ICML Petr Berka, LISp, 2002
5
PKDD Challenges http://lisp.vse.cz/challenge
1999, Prague financial data, thrombosis data 2000, Lyon financial data, modified thrombosis data 2001, Freiburg modified thrombosis data 2002, Helsinki atherosclerosis data, hepatitis data I’d like to acknowledge the contribution of Shusaku Tsumoto, who provided the challenge with the medical (thrombosis, hepatitis) data. DMLL Workshop, ICML Petr Berka, LISp, 2002
6
Financial Challenge Background
Czech bank offering private accounts Available data for pilot study (29000 clients) personal characteristics basic info about accounts transactions for three months Proposed tasks segmentation (defining different types of clients w.r.t. debt) early detection of debts DMLL Workshop, ICML Petr Berka, LISp, 2002
7
Financial Challenge Data
DMLL Workshop, ICML Petr Berka, LISp, 2002
8
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Contributions Method oriented show a method/system working on the data Problem oriented (prototype solutions) loan and/or credit cards description loan and/or credit cards classification initial exploration relation between branches clients segmentation DMLL Workshop, ICML Petr Berka, LISp, 2002
9
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Description of loans Relations between loan category and account characteristics [Coufal et al, GUHA] [Mikšovský et al, EXCEL] DMLL Workshop, ICML Petr Berka, LISp, 2002
10
Classification of loans
Detecting risky clients before they are granted a loan [Mikšovský et al, C5.0] decision tree to find the relevance of attributes decision tree for classification (using misclassification costs) DMLL Workshop, ICML Petr Berka, LISp, 2002
11
Credit Cards Promotion
Description - find characteristics of a card holder deviation detection Classification - predict score for „card value“ k-nearest neighbour [Putten, 1999] DMLL Workshop, ICML Petr Berka, LISp, 2002
12
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Clients Segmentation Description - segmentation of clients according to transactions [Hotho, Meadche, 2000] Kohonen map decision trees Rule #1 for Cluster 3 If ATTR5 > 9945 and ATTR13 > 0 Then -> Cluster 3 (115, 0.983) DMLL Workshop, ICML Petr Berka, LISp, 2002
13
Challenge Organizing Lessons
To get and prepare real data is difficult The time for analyzes should be as long as possible The response rate was rather low (~ 10%) No synergy effect observed DMLL Workshop, ICML Petr Berka, LISp, 2002
14
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (1/4) Cooperate with experts domain experts data experts . . . … and with users DMLL Workshop, ICML Petr Berka, LISp, 2002
15
DM Lessons (2/4) Use knowledge intensive preprocessing methods …
compute age and sex from birth_number set flags for different types of operations compute monthly characteristics of transactions (sum, avg, min, max) lbalance = 1/30 i balance(i) days(i). DMLL Workshop, ICML Petr Berka, LISp, 2002
16
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (3/4) Make the results understandable [Werner, Fogarty 2001] DMLL Workshop, ICML Petr Berka, LISp, 2002
17
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
DM Lessons (4/4) Show some (even preliminary) results soon experts are interested in solutions not in applying sophisticated methods DMLL Workshop, ICML Petr Berka, LISp, 2002
18
Discovery Challenge Benefits
Experts deeper insight into the data Participants experience with analyzing large real data motivations for further research ML/KDD Community prototype tasks/solutions (like the MiningMart project?) Organizators … invitation to DMLL Workshop :-) DMLL Workshop, ICML Petr Berka, LISp, 2002
19
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Thank You DMLL Workshop, ICML Petr Berka, LISp, 2002
20
DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002
Contributions DMLL Workshop, ICML Petr Berka, LISp, 2002
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.