Presentation is loading. Please wait.

Presentation is loading. Please wait.

Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com.

Similar presentations


Presentation on theme: "Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com."— Presentation transcript:

1 Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com

2 DocGraph  Based on FOIA request to CMS by Fred Trotter  Pre-released at Strata RX 2012  Medicare providers (more than doctors)  CY 2011 dates of service  Share 11 or more patients in a 30 day forward window  Initial access restricted to MedStartr funders 2

3 DocGraph by the numbers  Directed graph  Average total degree 52.8  940,492 providers (graph nodes/vertices)  49,685,810 shared edges 3

4 Geographic visualization 4 http://isurfsoftware.com/blog/2012/12/13/visualizing-geographic-connections-between-us-doctors/

5 DocGraph data 5

6 6

7 NPPES  National Plan and Provider Enumeration System  Source of NPI (National Provider Identifier)  No cost download  Information is entered and updated by provider -Data quality is good to poor   CSV file with 314 columns   A custom MySQL load script is used to normalize the database  Bloom.api open source project to make data easier to access -http://www.bloomapi.com/ 7

8 Tabular data 8

9 Things we can do with tabular data 9

10 Graph data Relation between authors and MeSH terms from PubMed 10 http://dx.doi.org/10.6084/m9.figshare.94595

11 Graph types  Undirected graph -Facebook friendships  Directed graph -Twitter: follow and be followed  Bipartite graph  Multipartite -RDF graph model -Property graph model  Allow parallel edges -RDF graph Model 11

12 Components of a network/graph 12

13 Graphs in healthcare  Prescriber and patient (bipartite) -NCPDP data with NPI  Referral data sets  Shared patients -DocGraph  Social networks -Tweeting about a disease  Limited by imagination 13

14 Generating GraphML  XML based file format for graphs  Readable by a large number of tools -Gephi -Mathematica -igraph (R)  NetworkX a Python library for graphs which can export to GraphML  GraphML is not a file format for really large graphs  GraphML is not readable by d3.js 14

15 15 GraphML can be loaded into Mathematica

16 Gephi 16

17 Gephi  Java based open source tool  Focused on interactivity -Fast graphics -Multi-threaded -Visual updates  Strong graph analytics  Graphs stored in memory -Upper limit is about 100,000 nodes  Netbeans plugin architecture -Integration with Neo4J -Additional layout algorithms 17

18 Downloading Gephi http://gephi.org/users/download/ 18

19 Downloading sample files https://dl.dropboxusercontent.com/u/21690634/DocGraph/docgraph_tutorial_examples.zip 19

20 Subsets are generated using a Python script 20 python extract_providers_to_graphml.py "npi='1750499653'" sterrence Leaf-edges Opening connection referral Configuration Selection criteria for subset graph: npi='1750499653' Referral table _name: referral.referral2011 NPI detail table name: referral.npi_summary_primary_taxonomy Nodes will be labeled by: provider_name Leaf-to-leaf edges will be exported? False … Imported 1 nodes … Imported 986 nodes … Imported 1724 edges Edge types imported {'core-to-leaf': 866, 'leaf-to-core': 856: None : 2} Leaf-to-leaf edges were not selected for export Writing GraphML file

21 Generating a subset: some concepts 21 Core nodes Adding leaf nodes Connecting core nodes Connecting to leaf nodes Connecting leaf nodes

22 Sample files  jamestown_core_provider_graph.graphml -Providers selected with practice addresses in Jamestown, NY -Small city in far western New York (approximately 30,000 residents) -179 nodes with 5,560 edges  jamestown_core_and_leaf_provider_graph.graphml -Includes providers above and those who are linked to them -1,322 nodes with 12,457 edges  albany_core_provider_graph.graphml -Providers selected with practice addresses in Albany, NY -A small city in New York (approximately 100,000 residents) -1,368 nodes with 44,711 edges 22

23 Sample files (continued)  bronx_core_provider_graph.graphml -Providers selected with practice addresses in Bronx, NY -Urban community (1.4 million residents) -3,268 nodes and 53,828 edges 23

24 Opening a graph file 24

25 Import report 25

26 Force directed layout of the graph 26

27 Results of the layout 27

28 ForceAtlas 2 works well for larger graphs 28

29 Navigating the graph  Best experience with a three button mouse with a scroll wheel -Right click and hold to pan -Scroll wheel to zoom in and out -Left click to select -Right click for context menus  MacBook users -command key and click and hold down on trackpad to pan -Two fingers to zoom on trackpad -Click on trackpad to select -Control click for context menus 29

30 Coloring the graph (partitioning) 30

31 Coloring the graph (partitioning) 31

32 Varying node size based on importance  Step 1: Need to select a measure for node importance -Degree -PageRank -Eigenvector centrality  Step 2: Run the measure against the graph  Step 3: Ranking tab and “Size/Weight”  Step 4: Set size range 32

33 Graph measures  Degree -In-degree -Out-degree  Graph structure measures -Clustering (global and local) -Network diameter  Centrality Measures -Eigenvector centrality -PageRank (Google search)  Community measures  And more..... 33

34 Interactively viewing node attributes 34 Click the “T” icon on the bottom to turn on node labeling

35 Data Laboratory 35

36 Selecting visible fields 36

37 Viewing edge attributes 37

38 Saving your graph  Save your graph in.gephi format -xml based format -preserves layout, size, and color  Save in GraphML format for use with outside programs 38

39 Filtering nodes by attributes 39

40 Hints for filtering nodes  Drag field filter “is_physician” from the top pane to the lower pane  Set the value to filter on -Value should equal 1 -1 is equivalent to true  Click “Filter” to apply 40

41 Producing a final graph 41 We need to rescale the edge weights in the graph

42 Producing a final graph after scaling 42

43 Bronx core provider graph 43

44 Challenge questions  Which institution is the most “important” provider for the Bronx? -Hint: try a centrality measure  Can you determine if geography plays a role in patient sharing in the Bronx? -Which parameter could be used to partition the graph?  Can you filter the graph to show only radiologists?  Which radiologist has the highest “authority” in the graph? 44

45 Other tools for graph analysis  NetworkX -Python -Lots of algorithms  igraph -R and Python  Gremlin – graph traversal and manipulation -Groovy shell -Gremlin interface is implemented for Neo4J  And more... 45

46 Scaling the analysis to the entire DocGraph  Most healthcare graphs will be big (millions of nodes)  What we learn at the local level can be applied at the global level -Importance of geography -Supernodes (radiologist, ER docs, pathologist, transportation, …)  Many graph measures don’t scale well -Maximal cliques  Currently exploring how to use Faunus to scale the analysis with Hadoop 46

47 Links http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.htmlhttp://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information) https://github.com/jhajagos/DocGraphhttps://github.com/jhajagos/DocGraph (code) http://notonlydev.com/docgraph-data/http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees) https://groups.google.com/forum/#!forum/docgraphhttps://groups.google.com/forum/#!forum/docgraph (mailing list) 47

48 Questions 48 Try to publish your own healthcare dataset as a graph!


Download ppt "Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com."

Similar presentations


Ads by Google