MA4404 Winter 2018 Introduction to Machine Learning

MA4404 Winter 2018 Introduction to Machine Learning
Dr. Ruriko Yoshida

Agenda Background/Motivation Methodology: Preliminary Results
Application of Network Analysis How Network Analysis is done The problem of incomplete information Network Classification Methodology: Graph Statistics Generation & Analysis Machine Learning Preliminary Results Conclusion/Further Work 2

Network Analysis ISIS Barabassi Albert Erdos-Renyi Small World

Application of Network Analysis
Source: The Daily Mail

However, there is the problem of incomplete information
Network Analysis is done in three steps: mapping the network, measuring network metrics, and classifying the network. This particular graphic explains how this is done. From one initial subject of interest (e.g. a terror suspect), a query of all his relationships (links) is made. Later, in a process of entity resolution, to make sure none of these new nodes are duplicates, the network collapses. Finally, a further expansion is made, where relationships (links) of the new nodes are investigated. However, a large proportion of these leads are unexplored. Network Analysis is done by (1) Mapping the network, (2) Measuring network metrics and (3) Classifying the network. However, there is the problem of incomplete information

The problem of incomplete information
“ Criminal network data is also inevitably incomplete; i.e. some existent links or nodes will be unobserved or unrecorded. Little research has been done on the effects of incomplete information on apparent structure.(Sparrow )” A large proportion of leads are unexplored due to analysts’ limited ability to process all the available data (Huddleston et al. 2004). Criminals tend to make a concerted effort to keep a low profile to avoid detection (Hopkins 2010). Determination of node centrality (the most important person in a network) for criminal networks, may be a result of who there is most information on, rather than who is the most important person structurally (Sparrow 1991).

Network Classification
Erdos-Renyi Network Random Network Small World Network “6 degrees of separation”, clusters with weak ties Vulnerable to attacks on key clusters & ties E.g. Al Qaeda Erdos-Renyi By Schulllz - Own work, CC BY-SA 3.0, Small World Barabassi-Albert (Scale-free) Network Some nodes with high degree and some with low degree Vulnerable to targeted attacks on key nodes with many connections By Schulllz - Own work, CC BY-SA 3.0, Barabassi Albert By HeMath (Own work) [CC BY-SA 4.0 ( via Wikimedia Commons

Simulating hidden edges & vertices
30% edge removal 27% vertex removal As part of my descriptive analysis, I removed nodes and edges to simulate the hidden nodes and edges, and stored the graph statistics at each step of the removals. As you can see in the diagram, for an initial network size of 20 nodes, if I choose to remove 15% (i.e. 3 edges) of the edges, the final graph will have 17 edges. I used R’s igraph package. In particular, a key thing to note is that for edge removals, they remove that edge, but for vertex removals they remove that vertex and all the edges connected to it.

Confusion? Small World Network or Erdos-Renyi Network?
SW network with 90% edge removals looks like ER network Small World Network with 90% information lost on edges Erdos-Renyi Network Barabasi-Albert Network or Small World Network ? BA network with 70% vertex removals looks like SW network However, when we remove nodes and edges, we can see that the graphs start to resemble each other. Not just in terms of appearance but the values of their statistics start to converge as well. Barabasi-Albert Network with 70% information lost on vertices Small World Network

Scenarios for Confusion
Effect of increasing probability of connection between vertices for SW(O) and ER(T) Distance Measure There are a couple of reasons for the confusion. The first one which affects the confusion between SW and ER graphs is as follows. With a change in p. the starting prob of conn of a graph, the SW and ER graph start to look more like each other. The same rule does not apply to BA and SW graphs. Proportion of Edge Removals Parameters were varied so scenarios in which network types can be confused could be explored 10

Descriptive Analysis List of Statistics: mean distance edge density
transitivity assortativity Kullback-Liebler statistic with theoretical ER, SW, BA degree distributions Hellinger statistic with theoretical ER SW, BA degree distributions Graph Type Size of starting graph Prob. of Conn. ER, SW, BA 100 nodes p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 500 1000 Graph Type Deletion Type No. of Graphs Iterations/ Graph No. of Steps ER, SW, BA Edge 10 Vertex

Methodology: Graph Statistics
Mean Distance Low mean distance is indicative of an efficient network Transitivity A high clustering coefficient is indicative that the mechanism for recruitment of new members is through a mutual friend, or transitive linking (Friemel 2011). Assortativity Assortativity indicates a preference or an inclination for the nodes in a graph to attach itself to other nodes which have similarities. “In positively assortative networks, high-degree nodes tend to cluster together as core groups, a phenomenon evident in the GSJ network in which bin Laden and his closest cohorts form the core of the network and issue commands to other parts of the network.” (Xu, 2008, 61) Edge Density/ Probability of Connection A High link density would mean that the network is not easily fragmentable Kullback-Liebler Divergence Hellinger Distance 12

Model on Training Dataset
Preliminary Results – Predictive Analysis Model on Training Dataset p < 0.061 Mean distance < 2.1 Mean distance < 8.3 So as can be seen in this diagram, the confusion comes because with information loss, the BA and SW graphs look like each other, and the ER and SW graphs look like each other. This confirms the findings in my descriptive analysis. In particular, with a high starting probability of connection (edge density > 0.061), ER graphs are correctly classified 89% of the time, but 0.09% of those graphs are actually SW graphs. The differentiating factor is the mean distance threshold of 2.1, in which they can be correctly classified as ER graphs 97% of the time and as SW graphs 79% of the time. SW and BA graphs also can get confused with each other when the starting probability of connection is low, or below The differentiating factor is the Hellinger distance statistic for SW graphs and the mean distance threshold of 8.3. In terms of variable importance, the mean distance, transitivity, Hellinger statistic with theoretical ER degree distribution, as well as the edge density/ probability of connection between vertices are the top three in terms of importance in classifying the graph. Probability of connection between vertices (p) determines the scenario Mean distance determines the classification of network 13

Preliminary Results – Predictive
Correct Classification Rate on Test Sets DOE (10%-80% loss of information) 80 % loss of Information 90 % loss of Information CART 96% 79.74% 56.57% Random Forest 99.14% 91.025% 72.63% So as can be seen in this diagram, the confusion comes because with information loss, the BA and SW graphs look like each other, and the ER and SW graphs look like each other. This confirms the findings in my descriptive analysis. In particular, with a high starting probability of connection (edge density > 0.061), ER graphs are correctly classified 89% of the time, but 0.09% of those graphs are actually SW graphs. The differentiating factor is the mean distance threshold of 2.1, in which they can be correctly classified as ER graphs 97% of the time and as SW graphs 79% of the time. SW and BA graphs also can get confused with each other when the starting probability of connection is low, or below The differentiating factor is the Hellinger distance statistic for SW graphs and the mean distance threshold of 8.3. CART RF Only require 20% of information to be ~80% accurate in classification of graphs 14 Graph Type Classification Error by class BA 1.59 % ER 5.56 % SW 2.36%

Conclusion/Research Contributions
Mean Distance is the most important variable in classification of networks Conditions in which it is reasonable to assert the true & accurate classification of networks In conclusion, I just want to round up by giving the answers to my 3 research questions. From my research, I found that edge density gave the highest predictive power in determining the network type. Changing prob of conn between vertices, proportion of info loss, and type of info loss had effects on the ability to classify a graph correctly, and I established a framework with which any network statistic can be evaluated for its ability to classify graphs. In this context, hidden edges (compared to hidden vertices) can lead to a mis-estimation of the probability of connection between vertices, p, making the graph look vastly different, and possibly leading to a misclassification of the graph. Developed a framework with which the utility of any statistic in classifying a network can be evaluated 15

Conclusion Future Work: Evaluation of Node Centrality Measures
Evaluation of statistics as nodes and edges are progressively added as per the Query-Collapse- Expand model Evaluation of statistics in directed graphs

MA4404 Winter 2018 Introduction to Machine Learning

Similar presentations

Presentation on theme: "MA4404 Winter 2018 Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MA4404 Winter 2018 Introduction to Machine Learning

Similar presentations

Presentation on theme: "MA4404 Winter 2018 Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback