Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.

Similar presentations


Presentation on theme: "Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics."— Presentation transcript:

1 Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics in on-line Social Networks Analysis Social networks analysis seminar Second introductory lecture

2 Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

3  Name the most central/significant node: 1 2 3 4 56 78 9 10 11 12 13 Node centrality

4 1 2 34 5678 9 10 11 12 13 Node centrality (continued)  Name it now!

5  Detection of the most popular actors in a network  Advertising  Identification of “super spreader” nodes  Health care / Epidemics  Identify vulnerabilities in network structure  Network design  … Node centrality: Applications

6 Node centrality (continued)  What makes a node central? Number of connections It is central if its removal disconnects the graph High number of shortest paths passing through the node Proximity to all other nodes Central node is the one whose neighbors are central …

7 Degree centrality  Degree centrality is the number of a node’s neighbours:  Alternative definitions are possible Take into account connection strengths Take into account connection directions …

8 Degree centrality: an example 1 2 34 5678 9 10 11 12 13 DegreeNode 44 36 37 38 39 310 211 212

9 Eigenvector centrality  Not all neighbours are equal Popular ones (with high degree) should weigh more! Eigenvector centrality of node v i Adjacency matrix, where Choosing the maximum eigenvalue guarantees all values are positive

10 Eigenvector centrality: an example

11 Closeness centrality  If a node is central, it can reach other nodes “quickly” Smaller average shortest paths, where Average length of shortest paths from v

12 Closeness centrality: an example 1 2 34 5678 9 10 11 12 13 ClosenessNode 0.3534 0.4386 0.4447 0.48 0.4289 0.34210 11 12

13 Betweeness centrality

14 Betweeness centrality: an example 1 2 34 5678 9 10 11 12 13 BetweenessNode 304 396 367 21.58 7.59 20.510 11 12

15 Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

16 Big Data  Data production rate dramatically increased o Social media data, mobile phone data, healthcare data, purchase data… Image taken from “data science and prediction”, CACM, December 2013

17 Data mining/ Knowledge Discovery in DB (KDD)  Infer actionable knowledge/insights from data o When men buy diapers on Fridays, they also buy beer o Email spamming accounts tend to cluster in communities o Both love & hate drive reality ratings.  Involves several classes of tasks o Anomaly detection o Association rule learning o Classification o Regression o Summarization o Clustering

18 Data mining process

19 Data instances

20 Data instances (continued) Predict whether an individual that visits an online book seller will buy a specific book Labeled example Unlabeled example

21 Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

22 Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

23 Learning = Improving with experience at some task Improve over task, T With respect to performance measure, P Based on experience, E Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

24 Machine Learning Applications?

25 Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Categories of ML algorithms

26 Supervised learning process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given unlabeled instances (x’,?), we compute m(x’) E.g., fraud/non-fraud prediction

27 Decision tree learning - an example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data Refund Yes No MarSt No Married No Single, Divorced TaxInc > 80K Yes < 80K No Splitting Attributes Class labels

28 Decision tree construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value.

29 Features selected based on set purity To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Purity is measured by entropy

30 Entropy example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is What is the range of entropy values? [0, 1] PureBalanced

31 Information gain (IG) We select the feature that is most useful in separating between classes to be learnt, based on IG IG is the difference between the entropy of the parent node and the average entropy of the child nodes We select the feature that maximizes IG

32 Information gain calculation example

33 Decision tree construction: example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data

34 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example

35 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example

36 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example

37 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example

38 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example

39 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example

40 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example

41 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example

42 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example

43 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example

44 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example

45 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example

46 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example

47 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example

48 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example

49 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example

50 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example

51 categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example

52 Classification quality metrics Binary classification (Instances, Class labels): (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) y i {1,-1} - valued Classifier: provides class prediction Ŷ for an instance Outcomes for a prediction: 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class

53 P(Ŷ = Y): accuracy (TP+TN) P(Ŷ = 1 | Y = 1): true positive rate/recall/sensitivity P(Ŷ = 1 | Y = -1): false positive rate P(Y = 1 | Ŷ = 1): precision ( TP/(TP+FP) ) 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class Classification quality metrics (cont'd)

54 Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ An individual can test either positive or negative for the disease Classification quality metrics: example

55 Test Result Individuals with disease Individuals without the disease Classification quality metrics: example

56 Machine Learning: Classification Test Result Call these patients “positive” Call these patients “negative”

57 Machine Learning: Classification Test Result Call these patients “negative”Call these patients “positive” without the disease with the disease True Positives

58 Machine Learning: Classification Test Result False Positives Call these patients “positive” Call these patients “negative” without the disease with the disease

59 Machine Learning: Classification Test Result True negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

60 Machine Learning: Classification Test Result False negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

61 Machine Learning: Cross-Validation What if we don’t have enough data to set aside a test dataset? Cross-Validation: Each data point is used both as train and test data. Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.

62 Machine Learning: Cross-Validation  Divide data into 10 equal pieces P 1 …P 10.  Fit 10 models, each on 90% of the data.  Each data point is treated as an out-of- sample data point by exactly one of the models.


Download ppt "Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics."

Similar presentations


Ads by Google