Visualizing Truly Large Datasets Nashville Analytics Summit 2015 Philip Best, Director Analytic Products HCA.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Britain Southwick Nicole Anguiano March 29, 2014
Introduction to Graph “theory”
Analysis and Modeling of Social Networks Foudalis Ilias.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Advanced Topics in Data Mining Special focus: Social Networks.
Guide To UNIX Using Linux Third Edition
ASP.NET Programming with C# and SQL Server First Edition
Chapter 11 ASP.NET JavaScript, Third Edition. 2 Objectives Learn about client/server architecture Study server-side scripting Create ASP.NET applications.
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
1 CS101 Introduction to Computing Lecture 19 Programming Languages.
Department of Computer Science, University of California, Irvine Site Visit for UC Irvine KD-D Project, April 21 st 2004 The Java Universal Network/Graph.
Social Media Mining Graph Essentials.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
Introduction to MATLAB adapted from Dr. Rolf Lakaemper.
Social Network Analysis: A Non- Technical Introduction José Luis Molina Universitat Autònoma de Barcelona
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Programming language A programming language is an artificial language designed to communicate instructions to a machine,languageinstructionsmachine particularly.
Data Abstraction UW CSE 140 Winter What is a program? – A sequence of instructions to achieve some particular purpose What is a library? – A collection.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
CHAPTER TEN AUTHORING.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Data structures Abstract data types Java classes for Data structures and ADTs.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Introduction to Interactive Media Interactive Media Tools: Authoring Applications.
Network Information: Manipulation, Sharing, and Visualization Dr. Greg Bernstein Grotto Networking
PHP vs. Python. Similarities are interpreted, high level languages with dynamic typing are Open Source are supported by large developer communities are.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
CIS 601 Fall 2003 Introduction to MATLAB Longin Jan Latecki Based on the lectures of Rolf Lakaemper and David Young.
Informatics tools in network science
Chapter 9: Graphs.
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Data Abstraction UW CSE 160 Spring What is a program? – A sequence of instructions to achieve some particular purpose What is a library? – A collection.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Matlab.
Hiroki Sayama NECSI Summer School 2008 Week 2: Complex Systems Modeling and Networks Network Models Hiroki Sayama
Social Networks Analysis
Introduction Python is an interpreted, object-oriented and high-level programming language, which is different from a compiled one like C/C++/Java. Its.
Spark Presentation.
Java Beans Sagun Dhakhwa.
Comparison of Social Networks by Likhitha Ravi
Network analysis.
Network Visualization
Introduction to MATLAB
Data Abstraction UW CSE 160 Winter 2017.
Network Science: A Short Introduction i3 Workshop
Section 7.12: Similarity By: Ralucca Gera, NPS.
Data Abstraction UW CSE 160 Winter 2016.
Graphs Chapter 11 Objectives Upon completion you will be able to:
Ruth Anderson UW CSE 160 Winter 2017
Data Abstraction UW CSE 160 Spring 2018.
Ruth Anderson UW CSE 160 Spring 2018
Ruth Anderson UW CSE 160 Winter 2016
ITEC 2620M Introduction to Data Structures
Chapter 14 Graphs © 2006 Pearson Addison-Wesley. All rights reserved.
Simulation And Modeling
Ruth Anderson CSE 160 University of Washington
Practical Applications Using igraph in R Roger Stanton
Analyzing Massive Graphs - ParT I
Data Abstraction UW CSE 160.
Presentation transcript:

Visualizing Truly Large Datasets Nashville Analytics Summit 2015 Philip Best, Director Analytic Products HCA

About Me (Us) HCA Lipscomb Trevecca R Users Group

The issue with ‘BIG’ data sets Does it have to be big? Reducing the boundaries without lying. Can’t get any smaller! Series versus batch Vector versus Raster

Why Python? R Julia Java.NET CUDA

Python is the Golden of Languages Python is always ready to play (weakishly typed) Almost always instantly gratifying (dynamic) Python loves you unconditionally (polyglot friendly)

Case in Point – Networks (Graphs) What is a network (graph)?

Case in Point - Networks What is a network (graph)? Graph: a way of representing the relationships among a collection of objects. Consists of a set of objects, called nodes, with certain pairs of these objects connected by links called edges. Undirected Graph Directed Graph Two nodes are neighbors if they are connected by an edge. Degree of a node is the number of edges ending at that node. For a directed graph, the in-degree and out-degree of a node refer to numbers of edges incoming to or outgoing from the node.

Identifying the hubs (centrality) Ego-centric methods really focus on the individual, rather than on network as a whole. By collecting information on the connections among the modes connected to a focal ego, we can build a picture of the local networks of the individual.

Identifying the hubs (centrality) A graph is connected if there is a path between every pair of nodes in the graph. A connected component is a subset of the nodes where: A path exists between every pair in the subset. The subset is not part of a larger set with the above property. In many empirical social networks a larger proportion of all nodes will belong to a single giant component.

Identifying the hubs (centrality) The neighbourhood of a node is set of nodes connected to it by an edge, not including itself. The clustering coefficient of a node is the fraction of pairs of its neighbors that have edges between one another. Locally indicates how concentrated the neighbourhood of a node is, globally indicates level of clustering in a graph. Global score is average over all nodes.

Identifying the hubs (centrality) A variety of different measures exist to measure the importance, popularity, or social capital of a node in a social network. Degree centrality focuses on individual nodes - it simply counts the number of edges that a node has. Hub nodes with high degree usually play an important role in a network. For directed networks, in-degree is often used as a proxy for popularity.

Identifying the hubs (centrality) The eigenvector centrality of a node proportional to the sum of the centrality scores of its neighbors. A node is important if it connected to other important nodes. A node with a small number of influential contacts may outrank one with a larger number of mediocre contacts. Computation: Calculate the Eigen decomposition of the pairwise adjacency matrix of the graph. Select the eigenvector associated with largest eigenvalue. Element ‘i’ in the eigenvector gives the centrality of the ‘i th’ node.

Introduction to NetworkX - Python in one slide Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasises code readability. “there should be one (and preferably only one) obvious way to do it”. Use of indentation for block delimiters Multiple programming paradigms: primarily object oriented and imperative, but also functional programming style. Dynamic type system, automatic memory management, late binding. Primitive types: int, float, complex, string, bool. Primitive structures: list, tuple, dictionary, set. Complex features: generators, lambda functions, list comprehension, list slicing.

Introduction to NetworkX “Python package for the creation, manipulation and study of the structure, dynamics and functions of complex networks.” Data structures for representing many types of networks, or graphs Nodes can be any (hashable) Python object, edges can contain arbitrary data Flexibility ideal for representing networks found in many different fields Easy to install on multiple platforms Online up-to-date documentation

Introduction to NetworkX Tool to study the structure and dynamics of social, biological, and infrastructure networks Ease-of-use and rapid development Open-source tool base that can easily grow in a multidisciplinary environment with non-expert users and developers An easy interface to existing code bases written in C, C++, and FORTRAN To painlessly slurp in relatively large nonstandard data sets

Introduction to NetworkX NetworkX defines no custom node objects or edge objects node-centric view of network nodes can be any hashable object, while edges are tuples with optional edge data (stored in dictionary) any Python object is allowed as edge data and it is assigned and stored in a Python dictionary (default empty) NetworkX is all based on Python Instead, other projects use custom compiled code and Python: Boost Graph, igraph, Graphviz Focus on computational network modelling not software tool development Move fast to design new algorithms or models Get immediate results

Introduction to NetworkX Use Dijkstra’s algorithm to find the shortest path in a weighted and unweighted network: >>> import networkx as nx >>> g = nx.Graph() >>> g.add_edge(’a’,’b’,weight=0.1) >>> g.add_edge(’b’,’c’,weight=1.5) >>> g.add_edge(’a’,’c’,weight=1.0) >>> g.add_edge(’c’,’d’,weight=2.2) >>> print nx.shortest_path(g,’b’,’d’) [’b’, ’c’, ’d’] >>> print nx.shortest_path(g,’b’,’d’,weighted=True) [’b’, ’a’, ’c’, ’d’]

Introduction to NetworkX >>> import networkx as nx There are different Graph classes for undirected and directed networks. Let’s create a basic Graph class >>> g = nx.Graph() # empty graph The graph g can be grown in several ways. NetworkX includes many graph generator functions and facilities to read and write graphs in many formats.

Introduction to NetworkX # One node at a time >>> g.add_node(1) # method of nx.Graph # A list of nodes >>> g.add_nodes_from([2,3]) # A container of nodes >>> h = nx.path_graph(10) >>> g.add_nodes_from(h) # g now contains the nodes of h # In contrast, you can remove any node of the graph >>> g.remove_node(2)

Introduction to NetworkX A node can be any hashable object such as strings, numbers, files, functions, and more. This provides important flexibility to all your projects. >>> import math >>> g.add_node(math.cos) # cosine function >>> fh=open(’tmp.txt’,’w’) # file handle >>> g.add_node(fh) >>> print g.nodes() [, ]

Introduction to NetworkX # Single edge >>> g.add_edge(1,2) >>> e=(2,3) >>> g.add_edge(*e) # unpack edge tuple # List of edges >>> g.add_edges_from([(1,2),(1,3)]) # Container of edges >>> g.add_edges_from(h.edges()) # In contrast, you can remove any edge of the graph >>> g.remove_edge(1,2)

Introduction to NetworkX >>> g.add_edges_from([(1,2),(1,3)]) >>> g.add_node(‘a’) >>> g.number_of_nodes() # also g.order() 4 >>> g.number_of_edges() # also g.size() 2 >>> g.nodes() [1, 2, 3, ‘a’] >>> g.edges() [(1, 2), (1, 3)] >>> g.neighbors(1) [2, 3] >>> g.degree(1) 2

Introduction to NetworkX NetworkX takes advantage of Python dictionaries to store node and edge measures. The dict type is a data structure that represents a key-value mapping. # Keys and values can be of any data type >>> fruit_dict={"apple":1,"orange":[0.23,0.11],42:True} # Can retrieve the keys and values as Python lists (vector) >>> fruit_dict.keys() [ "orange", "apple", 42 ] # Or create a (key,value) tuple >>> fruit_dict.items() [("orange",[0.23,0.11]),("apple",1),(42,True)] # This becomes especially useful when you master Python list- comprehension

Introduction to NetworkX Any NetworkX graph behaves like a Python dictionary with nodes as primary keys (only for access!) >>> g.add_node(1, time=’5pm’) >>> g.node[1][’time’] ’5pm’ >>> g.node[1] # Python dictionary {’time’: ’5pm’} The special edge attribute ’weight’ should always be numeric and holds values used by algorithms requiring weighted edges. >>> g.add_edge(1, 2, weight=4.0 ) >>> g[1][2][‘weight’] = 5.0 # edge already added >>> g[1][2] {‘weight’: 5.0}

Introduction to NetworkX Many applications require iteration over nodes or over edges: simple and easy in NetworkX >>> g.add_edge(1,2) >>> for node in g.nodes(): print node, g.degree(node) 1, 1 2, 1 >>> g.add_edge(1,3,weight=2.5) >>> g.add_edge(1,2,weight=1.5) >>> for n1,n2,attr in g.edges(data=True): # unpacking print n1,n2,attr[‘weight’] 1, 2, 1.5 1, 3, 2.5

Introduction to NetworkX >>> dg = nx.DiGraph() >>> dg.add_weighted_edges_from([(1,4,0.5), (3,1,0.75)]) >>> dg.out_degree(1,weighted=True) 0.5 >>> dg.degree(1,weighted=True) 1.25 >>> dg.successors(1) [4] >>> dg.predecessors(1) [3] Some algorithms work only for undirected graphs and others are not well defined for directed graphs. If you want to treat a directed graph as undirected for some measurement you should probably convert it using Graph.to_undirected()

Introduction to NetworkX NetworkX provides classes for graphs which allow multiple edges between any pair of nodes, MultiGraph and MultiDiGraph. This can be powerful for some applications, but many algorithms are not well defined on such graphs: shortest path is one example. Where results are not well defined you should convert to a standard graph in a way that makes the measurement well defined. >>> mg = nx.MultiGraph() >>> mg.add_weighted_edges_from([(1,2,.5), (1,2,.75), (2,3,.5)]) >>> mg.degree(weighted=True) {1: 1.25, 2: 1.75, 3: 0.5}

Introduction to NetworkX Classic graph operations subgraph(G, nbunch) - induce subgraph of G on nodes in nbunch union(G1,G2) - graph union disjoint_union(G1,G2) - graph union assuming all nodes are different cartesian_product(G1,G2) - return Cartesian product graph compose(G1,G2) - combine graphs identifying nodes common to both complement(G) - graph complement create_empty_copy(G) - return an empty copy of the same graph class convert_to_undirected(G) - return an undirected representation of G convert_to_directed(G) - return a directed representation of G

Introduction to NetworkX # small famous graphs >>> petersen=nx.petersen_graph() >>> tutte=nx.tutte_graph() >>> maze=nx.sedgewick_maze_graph() >>> tet=nx.tetrahedral_graph() # classic graphs >>> K_5=nx.complete_graph(5) >>> K_3_5=nx.complete_bipartite_graph(3,5) >>> barbell=nx.barbell_graph(10,10) >>> lollipop=nx.lollipop_graph(10,20) # random graphs >>> er=nx.erdos_renyi_graph(100,0.15) >>> ws=nx.watts_strogatz_graph(30,3,0.1) >>> ba=nx.barabasi_albert_graph(100,5) >>> red=nx.random_lobster(100,0.9,0.9)

Introduction to NetworkX NetworkX is not primarily a graph drawing package but it provides basic drawing capabilities by using matplotlib. For more complex visualization techniques it provides an interface to use the open source GraphViz software package. >>> import pylab as plt #import Matplotlib plotting interface >>> g = nx.erdos_renyi_graph(100,0.15) >>> nx.draw(g) >>> nx.draw_random(g) >>> nx.draw_circular(g) >>> nx.draw_spectral(g) >>> plt.savefig(‘graph.png’)

Introduction to NetworkX import d3py import networkx as nx G = nx.Graph() G.add_edge(1,2) G.add_edge(1,3) G.add_edge(3,2) G.add_edge(3,4) G.add_edge(4,2) # use 'with' if you are writing a script and want to serve this up forever with d3py.NetworkXFigure(G, width=500, height=500) as p: p += d3py.ForceLayout() p.show()

Plotly

Bokeh Client/Server ‘Static’ HTML