Berendt: Knowledge and the Web, 2015, 1 Knowledge and the Web – Inferring new knowledge from data(bases):

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Brief introduction on Logistic Regression
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Berendt: Knowledge and the Web, 2014, 1 Knowledge and the Web – Exploring your data and testing your hypotheses:
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.
Decision Tree Algorithm
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
Ensemble Learning: An Introduction
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Decision Trees an Introduction.
1 1 1 Berendt: Advanced databases, 2010, Advanced databases – Inferring implicit/new knowledge from data(bases):
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Berendt: Knowledge and the Web, 2014, 1 Knowledge and the Web – Inferring new knowledge from data(bases):
Module 04: Algorithms Topic 07: Instance-Based Learning
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Issues with Data Mining
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Chapter 9 – Classification and Regression Trees
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Ch10 Machine Learning: Symbol-Based
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Decision Trees by Muhammad Owais Zahid
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Data Mining Lecture 11.
Clustering.
Machine Learning: Lecture 3
Dept. of Computer Science University of Liverpool
Text Categorization Berlin Chen 2003 Reference:
Junheng, Shengming, Yunsheng 10/19/2018
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Berendt: Knowledge and the Web, 2015, 1 Knowledge and the Web – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt KU Leuven, Department of Computer Science Last update: 25 November 2015

Berendt: Knowledge and the Web, 2015, 2 Where are we?

Berendt: Knowledge and the Web, 2015, 3 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 4 What should we recommend to a customer/user?

Berendt: Knowledge and the Web, 2015, 5 What‘s spam and what isn‘t?

Berendt: Knowledge and the Web, 2015, 6 Classification / prediction: how is that done? In which weather will someone play (tennis etc.)? NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

Berendt: Knowledge and the Web, 2015, 7 Classification / prediction: What makes people happy?

Berendt: Knowledge and the Web, 2015, 8 “Classification along a numerical scale“: other forms of sentiment analysis 8

Berendt: Knowledge and the Web, 2015, 9 When we don‘t know the classes yet, but need to discover them: What “news stories“ are there today? 9

Berendt: Knowledge and the Web, 2015, 10 What „circles“ of friends do you have?

Berendt: Knowledge and the Web, 2015, 11 What „circles“ of friends do you have?

Berendt: Knowledge and the Web, 2015, 12 Topic detection: What topics exist in a collection of texts, and how do they evolve? News texts, scientific publications, speeches, …

Berendt: Knowledge and the Web, 2015, 13 From your questions to the speakers These days you hear a lot about Big Data. Nobody seems to have a really good definition for it though. Do you see linked data as a part of Big Data or more as something separate.

Berendt: Knowledge and the Web, 2015, 14 A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (1)

Berendt: Knowledge and the Web, 2015, 15 A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (2)

Berendt: Knowledge and the Web, 2015, 16 A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (3)

Berendt: Knowledge and the Web, 2015, 17 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 18 Forms of data analysis Confirmatory Hypothesis testing Experimental procedure, data gathered for this purpose Inferential statistics Causality Exploratory Data mining Already-existing data Data mining & machine learning models “Correlation“ (in a wide sense) Different basic assumptions, different evaluation methodologies, even when they use the same models (e.g. regression)!

Berendt: Knowledge and the Web, 2015, 19 Styles of reasoning Descriptive vs. predictive Deductive vs. inductive inference Data mining prediction is always inductive inference!

Berendt: Knowledge and the Web, 2015, 20 From your questions Are there any economic indicators, related to the (country of representation of a) speaker that influence how many speeches are given by a certain country in the European parliament? Are economically more powerful countries more influential in the European parliament? Why does Germany have so much influence on European politics or is this a false statement?

Berendt: Knowledge and the Web, 2015, 21 Empiricism and apophenia 21

Berendt: Knowledge and the Web, 2015, 22 Empiricism and apophenia: correlation, causation, and instrumentality 22

Berendt: Knowledge and the Web, 2015, 23 “Correlation replaces causation“: Business logic and prediction vs. explanation... 23

Berendt: Knowledge and the Web, 2015, 24 A related issue: number of data points / From your questions Does the weather in Finland during the European Parliament elections affect the voting behaviour of the Finnish people?

Berendt: Knowledge and the Web, 2015, 25 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

26 Berendt: Advanced databases, first semester 2011, 26 The KDD process: The output The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) non-trivial process Multiple process valid Justified patterns/models novel Previously unknown useful Can be used understandable by human and machine

Berendt: Knowledge and the Web, 2015, 27 The process part of knowledge discovery CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

Berendt: Knowledge and the Web, 2015, 28 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn.“ = „modeling“ phase n Data mining l sometimes = KD, sometimes = ML

Berendt: Knowledge and the Web, 2015, 29 How much time will you actually spend modelling?

Berendt: Knowledge and the Web, 2015, 30 Standard data mining algorithms work on single tables  Important Q for data preparation: How to get from an RDF graph to a table?

Berendt: Knowledge and the Web, 2015, 31 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 32 Descriptive and predictive modelling / learning NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

Berendt: Knowledge and the Web, 2015, 33 From your questions Are economically more powerful countries more influential in the European parliament?... Economically powerful countries can be based on different factors, including Gross Domestic Product per Capita...

Berendt: Knowledge and the Web, 2015, 34 A simple descriptive statistic: Correlation

Berendt: Knowledge and the Web, 2015, 35 “Truly numerical data“: Pearson correlation

Berendt: Knowledge and the Web, 2015, 36 From your questions Is there a correlation between the countries of the speakers who give speeches about the environment and the countries that have the best environmental policies? (pollution, renewable energy, waste generation, etc.)

Berendt: Knowledge and the Web, 2015, 37 Rank data: Spearman‘s rank correlation coefficient

Berendt: Knowledge and the Web, 2015, 38 Unclear to me / From your questions Is there a correlation between BBC coverage and the topic of the talks given at the European Parliament? Is there a correlation between the government type of a country and how much its members talk about democracy?

Berendt: Knowledge and the Web, 2015, 39 Understand your data (1): Understand your concepts and how your variables measure them

Berendt: Knowledge and the Web, 2015, 40 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 41 Attributes …………… YesFalse8075Rainy YesFalse8683Overcast NoTrue9080Sunny NoFalse85 Sunny PlayWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, 42 What’s in an attribute? Each instance is described by a fixed predefined set of features, its “attributes” But: number of attributes may vary in practice  Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend of value of another one Possible attribute types (“levels of measurement”, aka “scales of measurement”):  Nominal, ordinal, interval and ratio

Berendt: Knowledge and the Web, 2015, 43 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 44 Task: align example measures, scale of measurement, and allowed operations ExampleScale leveloperations Temperature (celsius) Grades at school/university Pass or no pass (exam) Metres Temperature („warm“, „cold“,...) Weather („good“, „bad“) Weather („sunny“, „windy“, „cold crisp day“,...) Likert-scale values („on a scale of 1-7,...“) Duration of work tasks (in minutes) ECTS credits Nominal Ordinal Interval ratio =, ≠ +, - *, / % mode median arithmetic mean geom. mean

Berendt: Knowledge and the Web, 2015, 45 Nominal quantities Values are distinct symbols  Values themselves serve only as labels or names  Nominal comes from the Latin word for name Example: attribute “outlook” from weather data  Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values (no ordering or distance measure)‏ Only equality tests can be performed

Berendt: Knowledge and the Web, 2015, 46 Ordinal quantities Impose order on values But: no distance between values defined Example: attribute “temperature” in weather data  Values: “hot” > “mild” > “cool” Note: addition and subtraction don’t make sense Example rule: temperature < hot  play = yes Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)‏

Berendt: Knowledge and the Web, 2015, 47 Interval quantities Interval quantities are not only ordered but measured in fixed and equal units Example 1: attribute “temperature” expressed in degrees Fahrenheit Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense  Zero point is not defined!

Berendt: Knowledge and the Web, 2015, 48 Ratio quantities Ratio quantities are ones for which the measurement scheme defines a zero point Example: attribute “distance”  Distance between an object and itself is zero Ratio quantities are treated as real numbers  All mathematical operations are allowed But: is there an “inherently” defined zero point?  Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)‏

Berendt: Knowledge and the Web, 2015, 49 Task: What issues does this data collection have? (Curriculummapping, crosses for Bachelor course “Databases“) BTW, I think it does make a lot of sense for instructors to reflect on what they cover and what they test, & such lists can be helpful for this exercise.

Berendt: Knowledge and the Web, 2015, 50

Berendt: Knowledge and the Web, 2015, 51

Berendt: Knowledge and the Web, 2015, 52 Understanding your data (2): Visualize!

Berendt: Knowledge and the Web, 2015, 53 Understanding your data (3): How to visualize non- numerical data? Is there a correlation between the government type of a country and how much its members talk about democracy?  How could you visualize data on this to avoid drawing wrong conclusions already at the outset?

Berendt: Knowledge and the Web, 2015, 54 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 55 Supervised and unsupervised learning and examples dealt with here Supervised learning Classification / classifier learning regression Unsupervised learning Association rule mining Clustering

Berendt: Knowledge and the Web, 2015, 56 A question to the speakers that I don‘t quite understand A lot of hierarchies in RDF specifications are built using some human compromise between the properties of a concept and the hierarchy in which the concept is classified. Unsupervised learners already outperform humans in some classification tasks. How does this automatisation influence the availability of linked open data?

Berendt: Knowledge and the Web, 2015, 57 How to: our proposal Basic KDD techniques: frame your research question in terms of one of these tasks, use software to analyse your data (e.g. RapidMiner) Advanced KDD techniques (topic detection, sentiment analysis): use 3rd-party software (Sebastijan will provide a list) More advanced ideas? Ask / consult with us!

Berendt: Knowledge and the Web, 2015, 58 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 59 From your questions Which European politicians have a high chance of receiving a Nobel Prize?  For the sake of the argument, let us rephrase this a bit to give a typical classification task (see later for a more appropriate formalization): People with what features (feature values) get a Nobel Prize?

Berendt: Knowledge and the Web, 2015, 60 Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class Will illustrate key ideas with ID3, a very simple decision-tree learning algorithm

Berendt: Knowledge and the Web, 2015, 61 Which attribute to select?

Berendt: Knowledge and the Web, 2015, 62 Which attribute to select?

Berendt: Knowledge and the Web, 2015, 63 Criterion for attribute selection Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain

Berendt: Knowledge and the Web, 2015, 64 Computing information Measure information in bits  Given a probability distribution, the info required to predict an event is the distribution’s entropy  Entropy gives the information required in bits (can involve fractions of bits!)‏ Formula for computing the entropy:

Berendt: Knowledge and the Web, 2015, 65 Example: attribute Outlook

Berendt: Knowledge and the Web, 2015, 66 Computing information gain Information gain: information before splitting – information after splitting Information gain for attributes from weather data: gain(Outlook ) = bits gain(Temperature ) = bits gain(Humidity ) = bits gain(Windy ) = bits gain(Outlook )= info([9,5]) – info([2,3],[4,0],[3,2])‏ = – = bits

Berendt: Knowledge and the Web, 2015, 67 Continuing to split gain(Temperature )= bits gain(Humidity ) = bits gain(Windy )= bits

Berendt: Knowledge and the Web, 2015, 68 Final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further

Berendt: Knowledge and the Web, 2015, 69 Wishlist for a purity measure Properties we require from a purity measure:  When node is pure, measure should be zero  When impurity is maximal (i.e. all classes equally likely), measure should be maximal  Measure should obey multistage property (i.e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties!

Berendt: Knowledge and the Web, 2015, 70 Properties of the entropy The multistage property: Simplification of computation: Note: instead of maximizing info gain we could just minimize information

Berendt: Knowledge and the Web, 2015, 71 Variants Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan  Various improvements, e.g.  C4.5: deals with numeric attributes, missing values, noisy data  other measures instead of information gain (details see exercise session / individual) …………… YesFalse8075Rainy YesFalse8683Overcast NoTrue9080Sunny NoFalse85 Sunny PlayWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, 72 Classification rules Popular alternative to decision trees Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree)‏ Tests are usually logically ANDed together (but may also be general logical expressions)‏ Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule Individual rules are often logically ORed together  Conflicts arise if different conclusions apply

Berendt: Knowledge and the Web, 2015, 73 An example If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes

Berendt: Knowledge and the Web, 2015, 74 Transition: Trees for numeric prediction Regression: the process of computing an expression that predicts a numeric quantity Regression tree: “decision tree” where each leaf predicts a numeric quantity n Predicted value is average value of training instances that reach the leaf Model tree: “regression tree” with linear regression models at the leaf nodes n Linear patches approximate continuous function

Berendt: Knowledge and the Web, 2015, 75 An example …………… 40FalseNormalMildRainy 55FalseHighHotOvercast 0TrueHighHotSunny 5FalseHighHotSunny Play-timeWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, 76 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 77 From your questions Are economically more powerful countries more influential in the European parliament?... Economically powerful countries can be based on different factors, including Gross Domestic Product per Capita...

Berendt: Knowledge and the Web, 2015, 78 Lead question “How does the dependent variable depend on the independent one?“ “Can we predict the likely value of the dependent variable for a new data instance (with a given value of the independent variable)?“

Berendt: Knowledge and the Web, 2015, 79 Introduction to Linear Regression (the statistical approach) The Pearson correlation measures the degree to which a set of data points form a straight line relationship. Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data. Slides 44-49: slightly adapted from

Berendt: Knowledge and the Web, 2015, 80 Introduction to Linear Regression (cont.) Any straight line can be represented by an equation of the form Y = bX + a, where b and a are constants. The value of b is called the slope constant and determines the direction and degree to which the line is tilted. The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.

Berendt: Knowledge and the Web, 2015, 81

Berendt: Knowledge and the Web, 2015, 82 Introduction to Linear Regression (cont.) How well a set of data points fits a straight line can be measured by calculating the distance between the data points and the line. The total error between the data points and the line is obtained by squaring each distance and then summing the squared values. The regression equation is designed to produce the minimum sum of squared errors.

Berendt: Knowledge and the Web, 2015, 83 Introduction to Linear Regression (cont.) The equation for the regression line is

Berendt: Knowledge and the Web, 2015, 84

Berendt: Knowledge and the Web, 2015, 85 From your questions Are economically more powerful countries more influential in the European parliament?... Economically powerful countries can be based on different factors, including Gross Domestic Product per Capita Human Development Index...  Multiple regression (details: see exercise session)

Berendt: Knowledge and the Web, 2015, 86 From your questions Is there a correlation between the government type of a country and how much its members talk about democracy?  This has (assumed) categorical predictors, which can be modelled by dummy variables in a linear regression.  Dummy variables

Berendt: Knowledge and the Web, 2015, 87 Maybe better to frame as multiple linear regression / From your questions Is there a correlation between BBC coverage and the topic of the talks given at the European Parliament? Is there a correlation between the government type of a country and how much its members talk about democracy?  Both have (assumed) categorical predictors, which can be modelled by dummy variables in a linear regression.

Berendt: Knowledge and the Web, 2015, 88 From your questions Which European politicians have a high chance of receiving a Nobel Prize?

Berendt: Knowledge and the Web, 2015, 89 Logistic regression – input data

Berendt: Knowledge and the Web, 2015, 90 Logistic regression – fitting a curve

Berendt: Knowledge and the Web, 2015, 91 Logistic regression - prediction

Berendt: Knowledge and the Web, 2015, 92 From your questions Which European politicians have a high chance of receiving a Nobel Prize?  Note: Logistic regression also exists in multivariate form (= with multiple predictor variables)

Berendt: Knowledge and the Web, 2015, 93 Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, 94 From your questions To what extent are a politician‘s topics of choice influenced by * their field of study during higher education? * phrasing: See remark on “correlation vs. causation“ above! Are speeches in the European Parliament related to what the public think or search online?

Berendt: Knowledge and the Web, 2015, 95 Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart,...) Where to put: spaghetti, butter?

Berendt: Knowledge and the Web, 2015, 96 Data "Market basket data": attributes with boolean domains In a table  each row is a basket (aka transaction) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

Berendt: Knowledge and the Web, 2015, 97 Solution approach: The apriori principle and the pruning of the search tree (1) spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

Berendt: Knowledge and the Web, 2015, 98 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter  Solution approach: The apriori principle and the pruning of the search tree (2)

Berendt: Knowledge and the Web, 2015, 99 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter  Solution approach: The apriori principle and the pruning of the search tree (3)

Berendt: Knowledge and the Web, 2015, spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter  Solution approach: The apriori principle and the pruning of the search tree (4)

Berendt: Knowledge and the Web, 2015, More formally: Generating large k-itemsets with Apriori Min. support = 40% step 1: candidate 1-itemsets n Spaghetti: support = 3 (60%) n tomato sauce: support = 3 (60%) n bread: support = 4 (80%) n butter: support = 1 (20%) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

Berendt: Knowledge and the Web, 2015, Contd. step 2: large 1-itemsets n Spaghetti n tomato sauce n bread candidate 2-itemsets n {Spaghetti, tomato sauce}: support = 2 (40%) n {Spaghetti, bread}: support = 2 (40%) n {tomato sauce, bread}: support = 2 (40%) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

Berendt: Knowledge and the Web, 2015, step 3: large 2-itemsets n {Spaghetti, tomato sauce} n {Spaghetti, bread} n {tomato sauce, bread} candidate 3-itemsets n {Spaghetti, tomato sauce, bread}: support = 1 (20%) step 4: large 3-itemsets n { } Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce Contd.

Berendt: Knowledge and the Web, 2015, From itemsets to association rules Schema: If subset then large k-itemset with support s and confidence c n s = (support of large k-itemset) / # tuples n c = (support of large k-itemset) / (support of subset) Example: If {spaghetti} then {spaghetti, tomato sauce} n Support: s = 2 / 5 (40%) n Confidence: c = 2 / 3 (66%)

Berendt: Knowledge and the Web, 2015, From local associations to global models: clustering To what extent are a politician‘s topics of choice influenced by their field of study during higher education?  Can we find clusters of educational background and topics?

Berendt: Knowledge and the Web, 2015, Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, The basic idea of clustering: group similar things Group 1 Group 2 Attribute 1 Attribute 2

Berendt: Knowledge and the Web, 2015, Concepts in Clustering n Defining distance between points l Euclidean distance l any other distance (cityblock metric, Levenshtein, Jaccard sim....) n A good clustering is one where l (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized, l (Inter-cluster distance) while the distances between different clusters are maximized l Objective to minimize: F(Intra,Inter) n Clusters can be evaluated with “internal” as well as “external” measures l Internal measures are related to the inter/intra cluster distance l External measures are related to how representative are the current clusters to “true” classes

Berendt: Knowledge and the Web, 2015, K Means Example ( K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on

Berendt: Knowledge and the Web, 2015, K-means algorithm

Berendt: Knowledge and the Web, 2015, From local associations to global models: clustering To what extent are a politician‘s topics of choice influenced by their field of study during higher education?  Can we find clusters of educational background and topics?

Berendt: Knowledge and the Web, 2015, Clustering non-numerical data (to follow)

Berendt: Knowledge and the Web, 2015, Agenda Motivation: application examples Forms of data analysis and styles of reasoning The process of knowledge discovery Description and prediction Data understanding: two important notes (among other issues) Types of learning tasks Classification Regression Assocation-rule mining Clustering

Berendt: Knowledge and the Web, 2015, Next lecture More on KDD concepts and methods for your projects

Berendt: Knowledge and the Web, 2015, Supervised and unsupervised learning and examples dealt with here Supervised learning Classification / classifier learning regression Unsupervised learning Association rule mining Clustering What‘s the human input in both types?

Berendt: Knowledge and the Web, 2015, References / background reading; acknowledgements n The slides are based on l Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. l In particular, pp are based on the instructor slides for that book available at (chapters 1-4): (and...chapter2.pdf, chapter3.pdf, chapter4.pdf) or (and...chapter2.odp, chapter3.odp, chapter4.odp) n Scales (aka levels) of measurement are explained well here: [15 Nov 2014]