Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.

Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics in on-line Social Networks Analysis Social networks analysis seminar Second introductory lecture

Talk outline  Node centrality Degree Eigenvector Closeness Betweeness  Data mining & machine learning concepts  Classification

 Name the most central/significant node: 1 2 3 4 56 78 9 10 11 12 13 Node centrality

1 2 34 5678 9 10 11 12 13 Node centrality (continued)  Name it now!

 Detection of the most popular actors in a network  Advertising  Identification of “super spreader” nodes  Health care / Epidemics  Identify vulnerabilities in network structure  Network design  … Node centrality: Applications

Node centrality (continued)  What makes a node central? Number of connections It is central if its removal disconnects the graph High number of shortest paths passing through the node Proximity to all other nodes Central node is the one whose neighbors are central …

Degree centrality  Degree centrality is the number of a node’s neighbours:  Alternative definitions are possible Take into account connection strengths Take into account connection directions …

Degree centrality: an example 1 2 34 5678 9 10 11 12 13 DegreeNode 44 36 37 38 39 310 211 212

Eigenvector centrality  Not all neighbours are equal Popular ones (with high degree) should weigh more! Eigenvector centrality of node v i Adjacency matrix, where Choosing the maximum eigenvalue guarantees all values are positive

Eigenvector centrality: an example

Closeness centrality  If a node is central, it can reach other nodes “quickly” Smaller average shortest paths, where Average length of shortest paths from v

Closeness centrality: an example 1 2 34 5678 9 10 11 12 13 ClosenessNode 0.3534 0.4386 0.4447 0.48 0.4289 0.34210 11 12

Betweeness centrality

Betweeness centrality: an example 1 2 34 5678 9 10 11 12 13 BetweenessNode 304 396 367 21.58 7.59 20.510 11 12

Big Data  Data production rate dramatically increased o Social media data, mobile phone data, healthcare data, purchase data… Image taken from “data science and prediction”, CACM, December 2013

Data mining/ Knowledge Discovery in DB (KDD)  Infer actionable knowledge/insights from data o When men buy diapers on Fridays, they also buy beer o Email spamming accounts tend to cluster in communities o Both love & hate drive reality ratings.  Involves several classes of tasks o Anomaly detection o Association rule learning o Classification o Regression o Summarization o Clustering

Data mining process

Data instances

Data instances (continued) Predict whether an individual that visits an online book seller will buy a specific book Labeled example Unlabeled example

Herbert Alexander Simon: “Learning is any process by which a system improves performance from experience.” “Machine Learning is concerned with computer programs that automatically improve their performance through experience. “ Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

Learning = Improving with experience at some task Improve over task, T With respect to performance measure, P Based on experience, E Machine Learning Herbert Simon Turing Award 1975 Nobel Prize in Economics 1978

Machine Learning Applications?

Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Categories of ML algorithms

Supervised learning process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given unlabeled instances (x’,?), we compute m(x’) E.g., fraud/non-fraud prediction

Decision tree learning - an example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data Refund Yes No MarSt No Married No Single, Divorced TaxInc > 80K Yes < 80K No Splitting Attributes Class labels

Decision tree construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value.

Features selected based on set purity To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Purity is measured by entropy

Entropy example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is What is the range of entropy values? [0, 1] PureBalanced

Information gain (IG) We select the feature that is most useful in separating between classes to be learnt, based on IG IG is the difference between the entropy of the parent node and the average entropy of the child nodes We select the feature that maximizes IG

Information gain calculation example

Decision tree construction: example CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 categorical Integer class Training Data

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K Decision tree construction: example

categorical Integer class Refund Yes Splitting Attribute Training DataModel: Decision Tree CheatTaxable Income Marital status RefundT id No125KSingleYes1 No100KMarriedNo2 70KSingleNo3 120KMarriedYes4 95KDivorcedNo5 60KMarriedNo6 220KDivorcedYes7 85KSingleNo8 75KMarriedNo9 Yes90KSingleNo10 NO MarSt No Married NO Single, Divorced TaxInc > 80K Yes < 80K NO Decision tree construction: example

Classification quality metrics Binary classification (Instances, Class labels): (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) y i {1,-1} - valued Classifier: provides class prediction Ŷ for an instance Outcomes for a prediction: 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class

P(Ŷ = Y): accuracy (TP+TN) P(Ŷ = 1 | Y = 1): true positive rate/recall/sensitivity P(Ŷ = 1 | Y = -1): false positive rate P(Y = 1 | Ŷ = 1): precision ( TP/(TP+FP) ) 1 1True positive (TP) False positive (FP) False negative (FP) True negative (TN) True class Predicted class Classification quality metrics (cont'd)

Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ An individual can test either positive or negative for the disease Classification quality metrics: example

Test Result Individuals with disease Individuals without the disease Classification quality metrics: example

Machine Learning: Classification Test Result Call these patients “positive” Call these patients “negative”

Machine Learning: Classification Test Result Call these patients “negative”Call these patients “positive” without the disease with the disease True Positives

Machine Learning: Classification Test Result False Positives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Classification Test Result True negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Classification Test Result False negatives Call these patients “positive” Call these patients “negative” without the disease with the disease

Machine Learning: Cross-Validation What if we don’t have enough data to set aside a test dataset? Cross-Validation: Each data point is used both as train and test data. Basic idea: Fit model on 90% of the data; test on other 10%. Now do this on a different 90/10 split. Cycle through all 10 cases. 10 “folds” a common rule of thumb.

Machine Learning: Cross-Validation  Divide data into 10 equal pieces P 1 …P 10.  Fit 10 models, each on 90% of the data.  Each data point is treated as an out-of- sample data point by exactly one of the models.

Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.

Similar presentations

Presentation on theme: "Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.

Similar presentations

Presentation on theme: "Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics."— Presentation transcript:

Similar presentations

About project

Feedback