Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.

Slides:



Advertisements
Similar presentations
Data Mining Lecture 9.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,
Classification Techniques: Decision Tree Learning
Data Mining Classification: Naïve Bayes Classifier
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Tree Rong Jin. Determine Milage Per Gallon.
CSci 8980: Data Mining (Fall 2002)
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 5 Data mining : A Closer Look.
Data Mining Essentials Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Data Mining Essentials Introduction Data production rate has.
Chapter 7 Decision Tree.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
CS690L Data Mining: Classification
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Most of contents are provided by the website Data Mining Essentials TJTSD66: Advanced Topics in Social.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Illustration of the Classification Task: Learning Algorithm Model.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
CIS 335 CIS 335 Data Mining Classification Part I.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Illustrating Classification Task
Danny Hendler Advanced Topics in on-line Social Networks Analysis
DECISION TREES An internal node represents a test on an attribute.
Decision Trees.
Chapter 6 Classification and Prediction
Data Mining Classification: Basic Concepts and Techniques
MIS2502: Data Analytics Classification using Decision Trees
CSCI N317 Computation for Scientific Applications Unit Weka
Learning Chapter 18 and Parts of Chapter 20
CS639: Data Management for Data Science
COSC 4368 Intro Supervised Learning Organization
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics in on-line Social Networks Analysis Social networks analysis seminar Second introductory lecture

Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

 Name the most central/significant node: Node centrality

Node centrality (continued)  Name it now!

 Detection of the most popular actors in a network  Advertising  Identification of “super spreader” nodes  Health care / Epidemics  Identify vulnerabilities in network structure  Network design  … Node centrality: Applications

Node centrality (continued)  What makes a node central? Number of connections It is central if its removal disconnects the graph High number of shortest paths passing through the node Proximity to all other nodes Central node is the one whose neighbors are central …

Degree centrality  Degree centrality is the number of a node’s neighbours:  Alternative definitions are possible Take into account connection strengths Take into account connection directions …

Degree centrality: an example DegreeNode

Eigenvector centrality  Not all neighbours are equal Popular ones (with high degree) should weigh more! Eigenvector centrality of node v i Adjacency matrix, where Choosing the maximum eigenvalue guarantees all values are positive

Eigenvector centrality: an example

Closeness centrality  If a node is central, it can reach other nodes “quickly” Smaller average shortest paths, where Average length of shortest paths from v

Closeness centrality: an example ClosenessNode

Betweeness centrality

Betweeness centrality: an example BetweenessNode

Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

Big Data  Data production rate dramatically increased o Social media data, mobile phone data, healthcare data, purchase data… Image taken from “data science and prediction”, CACM, December 2013

Data mining/ Knowledge Discovery in DB (KDD)  Infer actionable knowledge/insights from data o When men buy diapers on Fridays, they also buy beer o spamming accounts tend to cluster in communities o Both love & hate drive reality ratings.  Involves several classes of tasks o Anomaly detection o Association rule learning o Classification o Regression o Summarization o Clustering

Data mining process

Data instances

Data instances (continued) Predict whether an individual that visits an online book seller will buy a specific book Labeled example Unlabeled example

Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

Learning = Improving with experience at some task Improve over task, T With respect to performance measure, P Based on experience, E Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

Machine Learning Applications?

Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Categories of ML algorithms

Supervised learning process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given unlabeled instances (x’,?), we compute m(x’) E.g., fraud/non-fraud prediction

Decision tree learning - an example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data Refund Yes No MarSt No Married No Single, Divorced TaxInc > 80K Yes < 80K No Splitting Attributes Class labels

Decision tree construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value.

Features selected based on set purity To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Purity is measured by entropy

Entropy example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is What is the range of entropy values? [0, 1] PureBalanced

Information gain (IG) We select the feature that is most useful in separating between classes to be learnt, based on IG IG is the difference between the entropy of the parent node and the average entropy of the child nodes We select the feature that maximizes IG

Information gain calculation example

Decision tree construction: example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example

Classification quality metrics Binary classification (Instances, Class labels): (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) y i {1,-1} - valued Classifier: provides class prediction Ŷ for an instance Outcomes for a prediction: 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class

P(Ŷ = Y): accuracy (TP+TN) P(Ŷ = 1 | Y = 1): true positive rate/recall/sensitivity P(Ŷ = 1 | Y = -1): false positive rate P(Y = 1 | Ŷ = 1): precision ( TP/(TP+FP) ) 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class Classification quality metrics (cont'd)

Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ An individual can test either positive or negative for the disease Classification quality metrics: example

Test Result Individuals with disease Individuals without the disease Classification quality metrics: example

Machine Learning: Classification Test Result Call these patients “positive” Call these patients “negative”

Machine Learning: Classification Test Result Call these patients “negative”Call these patients “positive” without the disease with the disease True Positives

Machine Learning: Classification Test Result False Positives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Classification Test Result True negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Classification Test Result False negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Cross-Validation What if we don’t have enough data to set aside a test dataset? Cross-Validation: Each data point is used both as train and test data. Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.

Machine Learning: Cross-Validation  Divide data into 10 equal pieces P 1 …P 10.  Fit 10 models, each on 90% of the data.  Each data point is treated as an out-of- sample data point by exactly one of the models.