Download presentation
Presentation is loading. Please wait.
Published byBarrie Griffith Modified over 8 years ago
1
LIACS Data Mining course an introduction
2
Course Information Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically) Practical exercises
3
Course Schedule Start Sept 12 Last lecture Dec 12 (exam preparation) Next four weeks: 2 hours later, 15:45 – 17:30 more news later Leidens Ontzet Oct 3, no course Exam: Jan 9, 14:00 – 17:00
4
Course Textbook Data Mining Practical Machine Learning Tools and Techniques third edition, Morgan Kaufmann, ISBN 978-0-12-374856-0 by Ian Witten and Eibe Frank (EUR 45,95 at Amazon.de)
5
Introduction Data Mining an overview and some examples
6
Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful, interesting knowledge from large sets of data secondary statistics: analyzing data that wasn ’ t originally collected for analysis
7
Data Mining, the big idea Organizations collect large amounts of data Often for administrative purposes Large body of experience Learning from experience Goals Prediction/forecasting Diagnostics Optimization …
8
2 Streams Mining for insight Understanding a domain Finding regularities between variables Interpretable models Examples: medicine, production, maintenance ‘ Black-box ’ Mining Don ’ t care how you do it, just do it well Optimization Examples: marketing, forecasting (financial, weather)
9
example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: more response fewer letters Customer information response 3% response 3% Customer information test mailing final mailing response 30% response 30% remainder Data Mining customer model
10
example: Bioinformatics Find genes involved in disease (Parkinson ’ s, Celiac, Neuroblastoma) Measurements from patients (1) and controls (0) Gene expression: measurements of 20k genes dataset 20,001 x 100 Challenges many variables few examples (patients), testing is expensive interactions between genes
11
Data Mining paradigms Classification (binary) class variable predict class of future cases most popular paradigm Clustering divide dataset into groups of similar cases Regression numeric target variable Association find dependencies between variables basket analysis, …
12
Classification (decision tree) Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No 0.2 0.4 0.07 0.1 0.64 0.51 0.25 0.01
13
Applying a classifier (decision tree) New customer: ( House = Rent, Age = 32, …) Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No prediction = Yes
14
Classification Tree makes attribute dependencies explicit Class depends on House The influence of other attributes is less Dependencies are (often) fuzzy multiple attributes are needed Perfect predictions are rare Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No
15
Graphical interpretation dataset with two attributes + 1 class (+/-) graphical interpretation of decision tree + + + + + + + + + ++ + - - - - - - - - 0 y x x < t x t y < t’ y t’
16
Graphical interpretation dataset with two attributes + 1 class (+/-) other classifiers + + + + + + + + + ++ + - - - - - - - - 0 y x Support Vector Machine Neural Network
17
Applications of DM Marketing outgoing incoming Bioinformatics & Medicine Fraud detection Risk management Insurance Enterprise resource planning
18
Break
19
Data Mining Applications
20
Training Data Speed Skating Speed skating team LottoNL-Jumbo Detailed historic data training details duration intensity competition results Finding patterns of effective training Visualise data Speed skating team LottoNL-Jumbo Detailed historic data training details duration intensity competition results Finding patterns of effective training Visualise data
21
Kjeld Nuis 178 races On average 2.89% above track record Specialises on 1000 m (2.1%) 2015-2016 Dutch champion 1000 m, 1500 m WC Distances: bronze 1000 m, silver 1500 m WC Sprint: ‘silver’ ISU World Cup: gold 1000 m, silver 1500 m
22
Total sum of load over last 5 days, morning sessions undesired result due to over-training advised upper limit
23
InfraWatch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’ 145 sensors time-dependent, at frequencies up to 100Hz multi-modal (sensor, video, different freq.) managing large data quantities, >5 Gb per day
24
InfraWatch: monitoring of infrastructure 34 geo-phones (vibration sensors) 44 embedded strain-gauges, 47 gauges outside 20 thermometers video camera weather station
25
sensor mining
26
Maintenance planning at KLM Routine checks of aircraft Maintenance requires up to 10k different parts Ordering parts incurs delay (costs)… … but so does keeping stock In theory 10k individual predictions Input maintenance history flight history, Sahara/North Pole Only few parts predictable
27
Cashflow Online Online personal finance overview All bank transactions are loaded into the application transactions are classified into different categories Data Mining predicts category
28
67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …
29
Fragmented results: Boodschappen (groceries) Contributie
30
Decision Tree over all categories true false
31
Data Mining at LIACS Applications bioinformatics (LUMC) Sports Analytics (LottoNL-Jumbo, PSV) Hollandse Brug (Strukton, TU Delft, RWS) fraud detection at Achmea health insurance and NZa ChartEx, medieval documents (English, Latin) Complex data graphical data (molecules) relational data (criminal careers) stream data (sensor-data, click-streams) …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.