Download presentation
Presentation is loading. Please wait.
Published byRolf Banks Modified over 9 years ago
1
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN petr.olmer@cern.ch
2
Data Management and Database Technologies 2 Petr Olmer: Data Mining Motivation ● What if we do not know what to ask? ● How to discover a knowledge in databases without a specific query? Computers are useless, they can only give you answers.
3
Data Management and Database Technologies 3 Petr Olmer: Data Mining Many terms, one meaning ● Data mining ● Knowledge discovery in databases ● Data exploration ● A non trivial extraction of novel, implicit, and actionable knowledge from large databases. – without a specific hypothesis in mind! ● Techniques for discovering structural patterns in data.
4
Data Management and Database Technologies 4 Petr Olmer: Data Mining What is inside? ● Databases – data warehousing ● Statistics – methods – but different data source! ● Machine learning – output representations – algorithms
5
Data Management and Database Technologies 5 Petr Olmer: Data Mining CRISP-DM CRoss Industry Standard Process for Data Mining task specification task specification data understanding data understanding data preparation data preparation deployment evaluation modeling http://www.crisp-dm.org
6
Data Management and Database Technologies 6 Petr Olmer: Data Mining Input data: Instances, attributes ABCD Mon21ayes Wed19byes Mon23ano Sun23dyes Fri24cno Fri18dyes Sat20cno Tue21bno Mon25bno
7
example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno
8
Data Management and Database Technologies 8 Petr Olmer: Data Mining Output data: Concepts ● Concept description = what is to be learned ● Classification learning ● Association learning ● Clustering ● Numeric prediction
9
Data Management and Database Technologies 9 Petr Olmer: Data Mining Task classes ● Predictive tasks – Predict an unknown value of the output attribute for a new instance. ● Descriptive tasks – Describe structures or relations of attributes. – Instances are not related!
10
Data Management and Database Technologies 10 Petr Olmer: Data Mining Models and algorithms ● Decision trees ● Classification rules ● Association rules ● k-nearest neighbors ● Cluster analysis
11
Data Management and Database Technologies 11 Petr Olmer: Data Mining Decision trees Inner nodes – test a particular attribute against a constant Leaf nodes – classify all instances that reach the leaf a b ca class C1 class C1 class C2 class C2 class C1 = 5 blue red > 0 <= 0hot cold
12
Data Management and Database Technologies 12 Petr Olmer: Data Mining Classification rules If precondition then conclusion An alternative to decision trees Rules can be read off a decision tree – one rule for each leaf – unambiguous, not ordered – more complex than necessary If (a>=5) then class C1 If (a 0) then class C1 If (a<5) and (b=“red”) and (c=“hot”) then class C2
13
Data Management and Database Technologies 13 Petr Olmer: Data Mining Classification rules Ordered or not ordered execution? ● Ordered – rules out of context can be incorrect – widely used ● Not ordered – different rules can lead to different conclusions – mostly used in boolean closed worlds ● only yes rules are given ● one rule in DNF
14
Data Management and Database Technologies 14 Petr Olmer: Data Mining Decision trees / Classification rules 1R algorithm for each attribute: for each value of that attribute: count how often each class appears find the most frequent class rule = assign the class to this attribute-value calculate the error rate of the rules choose the rules with the smallest error rate
15
outlook sunny-no2/5 overcast-yes0/4 rainy-yes2/5 total 4/14 outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno temp. hot-no*2/4 mild-yes2/6 cool-yes1/4 total 5/14 humidity high-no3/7 normal-yes1/7 total 4/14 windy false-yes2/8 true-no*3/6 total 5/14 example: 1R
16
Data Management and Database Technologies 16 Petr Olmer: Data Mining Decision trees / Classification rules Naïve Bayes algorithm ● Attributes are – equally important – independent ● For a new instance, we count the probability for each class. ● Assign the most probable class. ● We use Laplace estimator in case of zero probability. ● Attribute dependencies reduce the power of NB.
17
outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno sunnycoolhightrue? yes sunny2/9 cool3/9 high3/9 true3/9 overall9/14 0.0053 20.5 % no sunny3/5 cool1/5 high4/5 true3/5 overall5/14 0.0206 79.5 % example: Naïve Bayes
18
Data Management and Database Technologies 18 Petr Olmer: Data Mining Decision trees ID3: A recursive algorithm ● Select the attribute with the biggest information gain to place at the root node. ● Make one branch for each possible value. ● Build the subtrees. ● Information required to specify the class – when a branch is empty: zero – when the branches are equal: a maximum – f(a, b, c) = f(a, b + c) + g(b, c) ● Entropy:
19
outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno outlook sunny 2:3 overcast 4:0 rainy 3:2 temp. hot 2:2 cool 3:1 mild 4:2 humidity high 3:4 normal 6:1 windy false 6:2 true 3:3 temp. hot 0:2 cool 1:0 mild 1:1 humidity high 0:3 normal 2:0 windy false 1:2 true 1:1 example: ID3
20
outlook windy humidity yes noyes no high normal false true sunny overcast rainy example: ID3
21
Data Management and Database Technologies 21 Petr Olmer: Data Mining Classification rules PRISM: A covering algorithm ● For each class seek a way of covering all instances in it. ● Start with: If ? then class C1. ● Choose an attribute-value pair to maximize the probability of the desired classification. – include as many positive instances as possible – exclude as many negative instances as possible ● Improve the precondition. ● There can be more rules for a class! – Delete the covered instances and try again. only correct unordered rules
22
example: PRISM outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno If ? then P=yes If O=overcast then P=yes O = sunny2/5 O = overcast4/4 O = rainy3/5 T = hot 2/4 T = mild4/6 T = cool3/4 H = high3/7 H = normal6/7 W = false6/8 W = true3/6
23
outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno example: PRISM If ? then P=yes If H=normal then P=yes O = sunny2/5 O = rainy3/5 T = hot 0/2 T = mild3/5 T = cool2/3 H = high1/5 H = normal4/5 W = false4/6 W = true1/4
24
outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno example: PRISM O = sunny2/2 O = rainy2/3 T = mild2/2 T = cool2/3 W = false3/3 W = true1/2 If H=normal and ? then P=yes If H=normal and W=false then P=yes
25
outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno If O=overcast then P=yes If H=normal and W=false then P=yes If T=mild and H=normal then P=yes If O=rainy and W=false then P=yes example: PRISM
26
Data Management and Database Technologies 26 Petr Olmer: Data Mining Association rules ● Structurally the same as C-rules: If - then ● Can predict any attribute or their combination ● Not intended to be used together ● Characteristics: – Support = a – Accuracy = a / (a + b) Cnon C Pab non Pcd
27
Data Management and Database Technologies 27 Petr Olmer: Data Mining Association rules Multiple consequences ● If A and B then C and D ● If A and B then C ● If A and B then D ● If A and B and C then D ● If A and B and D then C
28
Data Management and Database Technologies 28 Petr Olmer: Data Mining Association rules Algorithm ● Algorithms for C-rules can be used – very inefficient ● Instead, we seek rules with a given minimum support, and test their accuracy. ● Item sets: combinations of attribute-value pairs ● Generate items sets with the given support. ● From them, generate rules with the given accuracy.
29
Data Management and Database Technologies 29 Petr Olmer: Data Mining k-nearest neighbor ● Instance-based representation – no explicit structure – lazy learning ● A new instance is compared with existing ones – distance metric ● a = b, d(a, b) = 0 ● a <> b, d(a, b) = 1 – closest k instances are used for classification ● majority ● average
30
outlooktemp.humiditywindyplaydistance sunnyhothighfalseno2 sunnyhothightrueno1 overcasthothighfalseyes3 rainymildhighfalseyes3 rainycoolnormalfalseyes3 rainycoolnormaltrueno2 overcastcoolnormaltrueyes2 sunnymildhighfalseno2 sunnycoolnormalfalseyes2 rainymildnormalfalseyes4 sunnymildnormaltrueyes2 overcastmildhightrueyes2 overcasthotnormalfalseyes4 rainymildhightrueno2 example: kNN sunnycoolhightrue?
31
Data Management and Database Technologies 31 Petr Olmer: Data Mining Cluster analysis ● Diagram: how the instances fall into clusters. ● One instance can belong to more clusters. ● Belonging can be probabilistic or fuzzy. ● Clusters can be hierarchical.
32
Data Management and Database Technologies 32 Petr Olmer: Data Mining Data mining Conclusion ● Different algorithms discover different knowledge in different formats. ● Simple ideas often work very well. ● There’s no magic!
33
Data Management and Database Technologies 33 Petr Olmer: Data Mining Text mining ● Data mining discovers knowledge in structured data. ● Text mining works with unstructured text. – Groups similar documents – Classifies documents into taxonomy – Finds out the probable author of a document – … ● Is it a different task?
34
Data Management and Database Technologies 34 Petr Olmer: Data Mining How do mathematicians work Settings 1: – empty kettle – fire – source of cold water – tea bag How to prepare tea: – put water into the kettle – put the kettle on fire – when water boils, put the tea bag in the kettle Settings 2: – kettle with boiling water – fire – source of cold water – tea bag How to prepare tea: – empty the kettle – follow the previous case
35
Data Management and Database Technologies 35 Petr Olmer: Data Mining Text mining Is it different? ● Maybe it is, but we do not care. ● We convert free text to structured data… ● … and “follow the previous case”.
36
Data Management and Database Technologies 36 Petr Olmer: Data Mining Google News How does it work? ● http://news.google.com ● Search web for the news. – Parse content of given web sites. ● Convert news (documents) to structured data. – Documents become vectors. ● Cluster analysis. – Similar documents are grouped together. ● Importance analysis. – Important documents are on the top
37
Data Management and Database Technologies 37 Petr Olmer: Data Mining From documents to vectors ● We match documents with terms – Can be given (ontology) – Can be derived from documents ● Documents are described as vectors of weights – d = (1, 0, 0, 1, 1) – t1, t4, t5 are in d – t2, t3 are not in d
38
Data Management and Database Technologies 38 Petr Olmer: Data Mining TFIDF Term Frequency / Inverse Document Frequency TF(t, d) = how many times t occurs in d DF(t) = in how many documents t occurs at least once Term is important if its – TF is high – IDF is high Weight(d, t) = TF(t, d) · IDF(t)
39
Data Management and Database Technologies 39 Petr Olmer: Data Mining Cluster analysis Vectors – Cosine similarity On-line analysis – A new document arrives. – Try k-nearest neighbors. – If neighbors are too far, leave it alone.
40
Data Management and Database Technologies 40 Petr Olmer: Data Mining Text mining Conclusion ● Text mining is very young. – Research is on-going heavily ● We convert text to data. – Documents to vectors – Term weights: TFIDF ● We can use data mining methods. – Classification – Cluster analysis – …
41
Data Management and Database Technologies 41 Petr Olmer: Data Mining References ● Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations ● Michael W. Berry: Survey of Text Mining: Clustering, Classification, and Retrieval ● http://kdnuggets.com/ ● http://www.cern.ch/Petr.Olmer/dm.html
42
Data Management and Database Technologies 42 Questions? Petr Olmer petr.olmer@cern.ch Computers are useless, they can only give you answers.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.