Download presentation
Presentation is loading. Please wait.
Published byKimberly Cassandra Hodge Modified over 8 years ago
1
Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics in on-line Social Networks Analysis Social networks analysis seminar Second introductory lecture
2
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
3
Name the most central/significant node: 1 2 3 4 56 78 9 10 11 12 13 Node centrality
4
1 2 34 5678 9 10 11 12 13 Node centrality (continued) Name it now!
5
Detection of the most popular actors in a network Advertising Identification of “super spreader” nodes Health care / Epidemics Identify vulnerabilities in network structure Network design … Node centrality: Applications
6
Node centrality (continued) What makes a node central? Number of connections It is central if its removal disconnects the graph High number of shortest paths passing through the node Proximity to all other nodes Central node is the one whose neighbors are central …
7
Degree centrality Degree centrality is the number of a node’s neighbours: Alternative definitions are possible Take into account connection strengths Take into account connection directions …
8
Degree centrality: an example 1 2 34 5678 9 10 11 12 13 DegreeNode 44 36 37 38 39 310 211 212
9
Eigenvector centrality Not all neighbours are equal Popular ones (with high degree) should weigh more! Eigenvector centrality of node v i Adjacency matrix, where Choosing the maximum eigenvalue guarantees all values are positive
10
Eigenvector centrality: an example
11
Closeness centrality If a node is central, it can reach other nodes “quickly” Smaller average shortest paths, where Average length of shortest paths from v
12
Closeness centrality: an example 1 2 34 5678 9 10 11 12 13 ClosenessNode 0.3534 0.4386 0.4447 0.48 0.4289 0.34210 11 12
13
Betweeness centrality
14
Betweeness centrality: an example 1 2 34 5678 9 10 11 12 13 BetweenessNode 304 396 367 21.58 7.59 20.510 11 12
15
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
16
Big Data Data production rate dramatically increased o Social media data, mobile phone data, healthcare data, purchase data… Image taken from “data science and prediction”, CACM, December 2013
17
Data mining/ Knowledge Discovery in DB (KDD) Infer actionable knowledge/insights from data o When men buy diapers on Fridays, they also buy beer o Email spamming accounts tend to cluster in communities o Both love & hate drive reality ratings. Involves several classes of tasks o Anomaly detection o Association rule learning o Classification o Regression o Summarization o Clustering
18
Data mining process
19
Data instances
20
Data instances (continued) Predict whether an individual that visits an online book seller will buy a specific book Labeled example Unlabeled example
21
Talk outline Node centrality Degree Eigenvector Closeness Betweeness Data mining & machine learning concepts Classification
22
Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978
23
Learning = Improving with experience at some task Improve over task, T With respect to performance measure, P Based on experience, E Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978
24
Machine Learning Applications?
25
Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Categories of ML algorithms
26
Supervised learning process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given unlabeled instances (x’,?), we compute m(x’) E.g., fraud/non-fraud prediction
27
Decision tree learning - an example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data Refund Yes No MarSt No Married No Single, Divorced TaxInc > 80K Yes < 80K No Splitting Attributes Class labels
28
Decision tree construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value.
29
Features selected based on set purity To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Purity is measured by entropy
30
Entropy example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is What is the range of entropy values? [0, 1] PureBalanced
31
Information gain (IG) We select the feature that is most useful in separating between classes to be learnt, based on IG IG is the difference between the entropy of the parent node and the average entropy of the child nodes We select the feature that maximizes IG
32
Information gain calculation example
33
Decision tree construction: example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data
34
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example
35
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example
36
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example
37
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example
38
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example
39
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example
40
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example
41
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example
42
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example
43
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example
44
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example
45
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example
46
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example
47
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example
48
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example
49
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example
50
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example
51
categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example
52
Classification quality metrics Binary classification (Instances, Class labels): (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) y i {1,-1} - valued Classifier: provides class prediction Ŷ for an instance Outcomes for a prediction: 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class
53
P(Ŷ = Y): accuracy (TP+TN) P(Ŷ = 1 | Y = 1): true positive rate/recall/sensitivity P(Ŷ = 1 | Y = -1): false positive rate P(Y = 1 | Ŷ = 1): precision ( TP/(TP+FP) ) 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class Classification quality metrics (cont'd)
54
Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ An individual can test either positive or negative for the disease Classification quality metrics: example
55
Test Result Individuals with disease Individuals without the disease Classification quality metrics: example
56
Machine Learning: Classification Test Result Call these patients “positive” Call these patients “negative”
57
Machine Learning: Classification Test Result Call these patients “negative”Call these patients “positive” without the disease with the disease True Positives
58
Machine Learning: Classification Test Result False Positives Call these patients “positive” Call these patients “negative” without the disease with the disease
59
Machine Learning: Classification Test Result True negatives Call these patients “positive” Call these patients “negative” without the disease with the disease
60
Machine Learning: Classification Test Result False negatives Call these patients “positive” Call these patients “negative” without the disease with the disease
61
Machine Learning: Cross-Validation What if we don’t have enough data to set aside a test dataset? Cross-Validation: Each data point is used both as train and test data. Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.
62
Machine Learning: Cross-Validation Divide data into 10 equal pieces P 1 …P 10. Fit 10 models, each on 90% of the data. Each data point is treated as an out-of- sample data point by exactly one of the models.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.