Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics in on-line Social Networks Analysis Social networks analysis seminar Second introductory lecture
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
Name the most central/significant node: Node centrality
Node centrality (continued) Name it now!
Detection of the most popular actors in a network Advertising Identification of “super spreader” nodes Health care / Epidemics Identify vulnerabilities in network structure Network design … Node centrality: Applications
Node centrality (continued) What makes a node central? Number of connections It is central if its removal disconnects the graph High number of shortest paths passing through the node Proximity to all other nodes Central node is the one whose neighbors are central …
Degree centrality Degree centrality is the number of a node’s neighbours: Alternative definitions are possible Take into account connection strengths Take into account connection directions …
Degree centrality: an example DegreeNode
Eigenvector centrality Not all neighbours are equal Popular ones (with high degree) should weigh more! Eigenvector centrality of node v i Adjacency matrix, where Choosing the maximum eigenvalue guarantees all values are positive
Eigenvector centrality: an example
Closeness centrality If a node is central, it can reach other nodes “quickly” Smaller average shortest paths, where Average length of shortest paths from v
Closeness centrality: an example ClosenessNode
Betweeness centrality
Betweeness centrality: an example BetweenessNode
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
Big Data Data production rate dramatically increased o Social media data, mobile phone data, healthcare data, purchase data… Image taken from “data science and prediction”, CACM, December 2013
Data mining/ Knowledge Discovery in DB (KDD) Infer actionable knowledge/insights from data o When men buy diapers on Fridays, they also buy beer o spamming accounts tend to cluster in communities o Both love & hate drive reality ratings. Involves several classes of tasks o Anomaly detection o Association rule learning o Classification o Regression o Summarization o Clustering
Data mining process
Data instances
Data instances (continued) Predict whether an individual that visits an online book seller will buy a specific book Labeled example Unlabeled example
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978
Learning = Improving with experience at some task Improve over task, T With respect to performance measure, P Based on experience, E Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978
Machine Learning Applications?
Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Categories of ML algorithms
Supervised learning process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given unlabeled instances (x’,?), we compute m(x’) E.g., fraud/non-fraud prediction
Decision tree learning - an example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data Refund Yes No MarSt No Married No Single, Divorced TaxInc > 80K Yes < 80K No Splitting Attributes Class labels
Decision tree construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value.
Features selected based on set purity To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Purity is measured by entropy
Entropy example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is What is the range of entropy values? [0, 1] PureBalanced
Information gain (IG) We select the feature that is most useful in separating between classes to be learnt, based on IG IG is the difference between the entropy of the parent node and the average entropy of the child nodes We select the feature that maximizes IG
Information gain calculation example
Decision tree construction: example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example
Classification quality metrics Binary classification (Instances, Class labels): (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) y i {1,-1} - valued Classifier: provides class prediction Ŷ for an instance Outcomes for a prediction: 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class
P(Ŷ = Y): accuracy (TP+TN) P(Ŷ = 1 | Y = 1): true positive rate/recall/sensitivity P(Ŷ = 1 | Y = -1): false positive rate P(Y = 1 | Ŷ = 1): precision ( TP/(TP+FP) ) 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class Classification quality metrics (cont'd)
Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ An individual can test either positive or negative for the disease Classification quality metrics: example
Test Result Individuals with disease Individuals without the disease Classification quality metrics: example
Machine Learning: Classification Test Result Call these patients “positive” Call these patients “negative”
Machine Learning: Classification Test Result Call these patients “negative”Call these patients “positive” without the disease with the disease True Positives
Machine Learning: Classification Test Result False Positives Call these patients “positive” Call these patients “negative” without the disease with the disease
Machine Learning: Classification Test Result True negatives Call these patients “positive” Call these patients “negative” without the disease with the disease
Machine Learning: Classification Test Result False negatives Call these patients “positive” Call these patients “negative” without the disease with the disease
Machine Learning: Cross-Validation What if we don’t have enough data to set aside a test dataset? Cross-Validation: Each data point is used both as train and test data. Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.
Machine Learning: Cross-Validation Divide data into 10 equal pieces P 1 …P 10. Fit 10 models, each on 90% of the data. Each data point is treated as an out-of- sample data point by exactly one of the models.