Presentation is loading. Please wait.

Presentation is loading. Please wait.

LIACS Data Mining course an introduction. Course Information  Course website: (will be updated periodically)  Practical.

Similar presentations


Presentation on theme: "LIACS Data Mining course an introduction. Course Information  Course website: (will be updated periodically)  Practical."— Presentation transcript:

1 LIACS Data Mining course an introduction

2 Course Information  Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically)  Practical exercises

3 Course Schedule  Start Sept 12  Last lecture Dec 12 (exam preparation)  Next four weeks: 2 hours later, 15:45 – 17:30  more news later  Leidens Ontzet Oct 3, no course  Exam: Jan 9, 14:00 – 17:00

4 Course Textbook Data Mining Practical Machine Learning Tools and Techniques third edition, Morgan Kaufmann, ISBN 978-0-12-374856-0 by Ian Witten and Eibe Frank (EUR 45,95 at Amazon.de)

5 Introduction Data Mining an overview and some examples

6 Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful, interesting knowledge from large sets of data secondary statistics: analyzing data that wasn ’ t originally collected for analysis

7 Data Mining, the big idea  Organizations collect large amounts of data  Often for administrative purposes  Large body of experience  Learning from experience  Goals  Prediction/forecasting  Diagnostics  Optimization  …

8 2 Streams  Mining for insight  Understanding a domain  Finding regularities between variables  Interpretable models  Examples: medicine, production, maintenance  ‘ Black-box ’ Mining  Don ’ t care how you do it, just do it well  Optimization  Examples: marketing, forecasting (financial, weather)

9 example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond:  more response  fewer letters Customer information response 3% response 3% Customer information test mailing final mailing response 30% response 30% remainder Data Mining customer model

10 example: Bioinformatics  Find genes involved in disease (Parkinson ’ s, Celiac, Neuroblastoma)  Measurements from patients (1) and controls (0)  Gene expression: measurements of 20k genes  dataset 20,001 x 100  Challenges  many variables  few examples (patients), testing is expensive  interactions between genes

11 Data Mining paradigms  Classification  (binary) class variable  predict class of future cases  most popular paradigm  Clustering  divide dataset into groups of similar cases  Regression  numeric target variable  Association  find dependencies between variables  basket analysis, …

12 Classification (decision tree) Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No 0.2 0.4 0.07 0.1 0.64 0.51 0.25 0.01

13 Applying a classifier (decision tree) New customer: ( House = Rent, Age = 32, …) Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No prediction = Yes

14 Classification  Tree makes attribute dependencies explicit  Class depends on House  The influence of other attributes is less  Dependencies are (often) fuzzy  multiple attributes are needed  Perfect predictions are rare Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Yes No

15 Graphical interpretation  dataset with two attributes + 1 class (+/-)  graphical interpretation of decision tree + + + + + + + + + ++ + - - - - - - - - 0 y x x < t x  t y < t’ y  t’

16 Graphical interpretation  dataset with two attributes + 1 class (+/-)  other classifiers + + + + + + + + + ++ + - - - - - - - - 0 y x Support Vector Machine Neural Network

17 Applications of DM  Marketing  outgoing  incoming  Bioinformatics & Medicine  Fraud detection  Risk management  Insurance  Enterprise resource planning

18 Break

19 Data Mining Applications

20 Training Data Speed Skating  Speed skating team LottoNL-Jumbo  Detailed historic data  training details duration intensity  competition results  Finding patterns of effective training  Visualise data  Speed skating team LottoNL-Jumbo  Detailed historic data  training details duration intensity  competition results  Finding patterns of effective training  Visualise data

21 Kjeld Nuis  178 races  On average 2.89% above track record  Specialises on 1000 m (2.1%)  2015-2016  Dutch champion 1000 m, 1500 m  WC Distances: bronze 1000 m, silver 1500 m  WC Sprint: ‘silver’  ISU World Cup: gold 1000 m, silver 1500 m

22 Total sum of load over last 5 days, morning sessions undesired result due to over-training advised upper limit

23 InfraWatch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’  145 sensors  time-dependent, at frequencies up to 100Hz  multi-modal (sensor, video, different freq.)  managing large data quantities, >5 Gb per day

24 InfraWatch: monitoring of infrastructure  34 geo-phones (vibration sensors)  44 embedded strain-gauges, 47 gauges outside  20 thermometers  video camera  weather station

25 sensor mining

26 Maintenance planning at KLM  Routine checks of aircraft  Maintenance requires up to 10k different parts  Ordering parts incurs delay (costs)…  … but so does keeping stock  In theory 10k individual predictions  Input  maintenance history  flight history, Sahara/North Pole  Only few parts predictable

27 Cashflow Online  Online personal finance overview  All bank transactions are loaded into the application  transactions are classified into different categories  Data Mining predicts category

28 67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …

29 Fragmented results: Boodschappen (groceries) Contributie

30 Decision Tree over all categories true false

31 Data Mining at LIACS  Applications  bioinformatics (LUMC)  Sports Analytics (LottoNL-Jumbo, PSV)  Hollandse Brug (Strukton, TU Delft, RWS)  fraud detection at Achmea health insurance and NZa  ChartEx, medieval documents (English, Latin)  Complex data  graphical data (molecules)  relational data (criminal careers)  stream data (sensor-data, click-streams)  …


Download ppt "LIACS Data Mining course an introduction. Course Information  Course website: (will be updated periodically)  Practical."

Similar presentations


Ads by Google