CINet: A CyberInfrastructure for Network Science NSF Software Development for CyberInfrastructure Grant OCI-1032677 Additional support by grants from DTRA.

Slides:



Advertisements
Similar presentations
Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
The road to reliable, autonomous distributed systems
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Information Retrieval in Practice
Course Instructor: Aisha Azeem
Overview of Search Engines
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Understanding and Managing WebSphere V5
Department of Computer Science, University of California, Irvine Site Visit for UC Irvine KD-D Project, April 21 st 2004 The Java Universal Network/Graph.
2012 National BDPA Technology Conference Creating Rich Data Visualizations using the Google API Yolanda M. Davis Senior Software Engineer AdvancED August.
A Free sample background from © 2001 By Default!Slide 1.NET Overview BY: Pinkesh Desai.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
Introduction to Information Retrieval CS 5604: Information Storage and Retrieval ProjCINETViz by Maksudul Alam, S M Arifuzzaman, and Md Hasanuzzaman Bhuiyan.
OpenAlea An OpenSource platform for plant modeling C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin.
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
CINET 1 GDS-calculator Graph Dynamical Systems Visualization Group CINETgds Sichao Wu, Yao Zhang Virginia Tech Dec CS5604 Final Project Presentation.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Class 10: Introduction to CINET Using CINET for network analysis and visualization Network Science: Introduction to CINET 2015 Prof. Boleslaw K. Szymanski.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Holding slide prior to starting show. A Portlet Interface for Computational Electromagnetics on the Grid Maria Lin and David Walker Cardiff University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
The EDGeS project receives Community research funding 1 Porting Applications to the EDGeS Infrastructure A comparison of the available methods, APIs, and.
NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Lesson 1 1 LESSON 1 l Background information l Introduction to Java Introduction and a Taste of Java.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Information Retrieval in Practice
Tools and Services Workshop
Central nodes (Python and Gephi).
Joslynn Lee – Data Science Educator
EPANET-MATLAB Toolkit An Open-Source Software for Interfacing EPANET with MATLAB™ Demetrios ELIADES, Marios KYRIAKOU, Stelios VRACHIMIS and Marios POLYCARPOU.
Supporting Fault-Tolerance in Streaming Grid Applications
Module 01 ETICS Overview ETICS Online Tutorials
Central Nodes (Python and Gephi).
AWS Cloud Computing Masaki.
Overview of Workflows: Why Use Them?
Resource Allocation for Distributed Streaming Applications
Analyzing Massive Graphs - ParT I
Presentation transcript:

CINet: A CyberInfrastructure for Network Science NSF Software Development for CyberInfrastructure Grant OCI Additional support by grants from DTRA V&V, DTRA CNIMS, NSF NetSE, NSF DIBBS Team Virginia Tech, Indiana Uni., SUNY Albany, Jackson State, Argonne National Lab, U. Chicago, NCAT, U. Houston Downtown

Network Science Research in network science has been increasing very rapidly in the last decade, in many different scientific fields. Networks can be very large: ~10 8 nodes, ~10 10 edges, requiring HPC for analysis There is a need for middleware, i.e., an interface layer – Domain experts don’t need to become experts in graph theory, data mining, and high-performance computing Number of papers with “Complex Networks” in the title “Network science is the study of network representations of physical, biological, and social phenomena” “Network Science”, National Academies Press,

Challenges Challenge 1: How can we make an analytic engine that – Reduces programming overhead, – Reuse s existing resources, – Enhances Portability Challenge 2: Can we make a simple computational interface so that domain experts can use the available resources interactively and with little or no training 3

What is CINet A broadly accessible cyberinfrastructure – A web portal that hides the details of computation and data management, thereby minimizing the learning effort required A flexible framework – Allows easy extension by integrating off-the-shelf network analysis suites for analysis and visualization; this means new algorithms can be added easily over time A common repository – Managing data, models, and results through a digital library that maintains metadata Fostering research, teaching and collaboration – Building a broad user base, from multiple disciplines, including incorporation into courses on network science at many different universities 4

CINet Vision CINet should be self-sustainable. User can contribute – new networks (graphs) – new network algorithms – hardware (compute resources) – research results CINet should be self-manageable. End users insulated from – complexities of resource allocation – scheduling, cross-platform interactions – other low-level concerns 5

CINet Concept and Overview 6

CINet Concept: Enabling Crowd-sourced Self- sustainability VizApp: App for Network Visualization Granite: Tool for Structural Analysis GDS-Cal: Tool for Phase Space Analysis of Graph dynamics Computing resources and data storage Middleware: Simfrastructure Network Science Courses (Albany, NCAT,JSU) Case Studies Add Network or course material Add Structural method Store meta- data and results Add Data and Statistical analysis method 7

Structural Analysis Tool (Granite) – 110+ networks (graphs) – 18+ network generators – 70+ network algorithms (measures) Adapted from 3 different graph libraries: GaLib, SNAP, NetworkX – Visualization of networks (uses Gephi toolkit) – Service for adding new networks (graphs) – Service for adding new structural analysis tools (graph algorithms) – Python-based domain specific language, NetScript, supporting complex workflows Graph Dynamical System Calculator (GDS Calc) – Analyzing the phasic structure of a graph dynamical system (smaller networks but complete phase structure) Resource Manager – Allows multiple computational resources to be used and selected – uses FutureGrid (more than 5000 processing cores) Website – Additional resources (course notes, videos, tutorials, research papers etc). Middleware and Digital Library CINet Overview 8

Granite: A Tool for Structural Analysis of Complex Networks 9

Granite: A Tool for Structural Analysis Granite allows users to run various network algorithms (measures) on a variety of networks – Measures can compute structural properties (e.g., degree distribution, clustering coefficient) of networks – Network size can range from tiny (10s of nodes) to very large (100s of millions of nodes) – Provides both exact and (provably) approximate algorithms Granite automatically picks best implementation of specified measure Granite automatically picks most appropriate compute resource 10

Granite includes network algorithmic modules from three graph algorithm libraries: – GaLib (A New Graph Algorithm Library) Developed at NDSSL, Virginia Tech Carefully designed simple memory-efficient data structure Streaming algorithms – Make one or more passes through the data file Sampling-based approximation algorithms External memory algorithms – Only a part of the data are stored in memory at a time Parallel algorithms – NetworkX Developed at Los Alamos National Lab by Aric Hagberg et al. Open source Python graph library – SNAP Developed at Stanford University by Jure Leskovec et al. Stanford Network Analysis Platform Graph Libraries 11

Currently CINet has about 80+ implementations of network algorithms (measures) available They cover different areas of algorithms such as – Graph Centrality related (15+ such algorithms) PageRank, Betweenness centrality, Clustering coefficient, k-Core, etc. – Shortest Path and Connectivity related (25+ such algorithms) BFS, DFS, MST, Dijkstra’s SSSP, Center, Diameter, Articulation point, Bridge edges, Strongly and Weakly connected component, Component size distribution, etc. – Subgraph and Motif Counting related (6+ such algorithms) Count triangles, Clique counts, Find maximal clique, etc. – Flow related (2+ such algorithms) Maximum flow, minimum cut, etc. – Others (10+ such algorithms) Biparite graph, Maximal independent set, Common neighbors, Chordal graph, etc. – Approximation Algorithms (10+) All pair shortest path related (Diameter, Shortest path distribution), Vertex cover, Dominating set, etc. More to come … Network Algorithms (Measures) in CINet 12

Network Generators in CINet 18+ network generators available – Random network generators (8+ such generators) Erdos-Renyi, Watts-Strogatz small-world, Barabasi-Albert, Waxman random graph, etc. – Deterministic network generators (8+ such generators) Binary, Star, Wheel, Grid, Torus, Hypercube, etc. – Many of them can generate networks with millions of nodes and edges – User can add networks after generating them in CINet Screenshot 13

Networks (Graphs) 130+ small and large networks – Larger network contains ~10M nodes and ~400M edges – Social contact networks Miami, Chicago, Washington DC, Detroit, New York, Seattle – Multi-modal urban transportation networks (e.g., subway, cars, buses). Portland, Oregon – Adolescent friendship networks High school in New River Valley – Blog and other online networks Twitter, Slashdot, Epinions – Infrastructure networks Ad hoc and mesh, phone call, electrical power – Biological networks Networks can be weighted, and labeled. Karate Club Network [Zachery 1977] 14

Network Visualization User can submit visualization job for a network. User can view & download generated visualization. Visualization has 2 user interfaces in Granite – Quick static view while selecting network for analysis – Detailed interactive view Uses Gephi toolkit 15

Network Visualization (Features) User chooses parameters for visualization – Layout algorithms such as Force atlas, Random, Yifan Hu, and Fruchterman Reingold – Feature based organization such as determining node size and color using degree, pagerank, betweenness centrality values – Applying community detection algorithm to visualize different communities in different colors For a very large network, a subgraph can be visualized User can import links to embed visualizations in other webpages Store rendered networks into organized structure (pdf, png, gefx,etc.) Community detection in Miami subgraph with 1000 nodes Amazon co-purchase network 16

NetScript: A Domain Specific Language for Network Analytics 17

NetScript: A Domain Specific Language that will allow domain experts to analyze graph across libraries without learning to code in them separately Challenges: – Incompatible graph representation formats. – Libraries developed in many languages (e.g., C/C++, Java, Python). – Multiple programming paradigms (sequential programming, parallel programming : MPI, OpenMP, multithreading). – Large data communication between the libraries. – Error handling across the systems – Ability to configure the systems dynamically NetScript: A Domain Specific Language 18

An Example Usecase: Generate a set of random graphs with a specific property (e.g., diameter>=6) and compute the average degree distribution. Significance: NetworkX and GaLib have random graph generators but they do not support graph generation with user defined properties. DSL Code Snippet: 1 G = generateGraphNP(1000,0.7) 2 #No of nodes=1000, 3 #Prob. of choosing an edge between 4 #any two node = 0.7, 5 #mapped to NetworkX API 6 while diameter(G)<6: 7 #NetworkX measure 8 G=shuffle(G, 0.5) 9 #shuffle or edge swap 10 #ratio=0.5, 11 #mapped to GaLib API NetworkX (Python lib) NetworkX (Python lib) Galib (C++ lib) u u v v x x y y v v y y u u x x Graph conversion & data passing through extended interpreter 19

Features of NetScript Python data structures for graphs, node, edge. Node and edge can be of any type and can be labeled. 16 Generators for classic, random and synthetic graphs. Standard 50+ graph algorithms (graph traversal, shortest path calculation, associatively computation, centrality detection etc.) Network visualization through Matplotlib Graph converters to and form LEDA., GEXF, GML, Pickle, YAML, Graph6 Automatic graph conversion and data passing between libraries. Extended Python interpreter for data passing between C++ compiler and Python interpreter Minimal internal graph conversion. 20

Making Granite Self-Sustainable: Concept of Services and Apps 21

Service 1 – User Management User can request account – Its FREE to use To open an account, go to: Go to: Tools -> Granite Click Register All the entities – Networks, Measures, Generators, Analyses have owners. Due to this, access privileges can be inherited. 22

Service 2 – Add Networks User can add network by uploading a file A network can be in any one of two formats – Adjacency list – Edge list For each uploaded network, – syntactic and semantic checks are done to validate the network – meta-data (no. of edges & nodes) are automatically calculated – network is automatically converted from one format to another User can specify – metadata for the uploaded network – visibility of the uploaded network as Public : available to all users for analysis Private: available only to the owner for analysis 23

Service 2 – Add Networks (Screenshot) 24

User can add network algorithms with the help of CINET team Resource manager ensures security by sandbox – Run the new executable in Virtual Machine Similar service for adding network generators Service 3 – Add Network Algorithms (Measures) 25

Service 4 – Memoization Provides a platform enabling repeatable science A digital library service – Stores the result of deterministic network algorithms, e.g., minimum spanning tree – All input parameters are captured and archived – Retrieve the results from previous runs when the same algorithm is run with same parameters without recomputing – Keeps the system manageable by saving compute resources, time, storage space, etc. Performance: 26

GDSCalc: Graph Dynamical System Calculator 27

GDSCalc: Graph Dynamical System Calculator Elements of a GDS (graph dynamical system). – (Dependency) graph, nodes (vertices) have state. – Vertex functions, f. Quantifies when a node changes state, and to what state, based on its neighborhood. – Update Scheme: the execution order of the vertex functions. Sequential: Nodes update one-at-a-time in a prescribed order. Synchronous: Nodes update simultaneously. A primary use: compute phase space. – Set of all state transitions for all possible system states. Example sequential update scheme Node update sequence: Permutation π = (3,5,2,4,1) v1v1 v2v2 v3v3 v4v4 v5v5 f4f4 f3f3 f1f1 f2f2 f5f5 Graph X: Example system state transition for ratcheted threshold-1 node function and Boolean {0,1} state space. (0,0,1,0,0)(0,1,1,1,0) (0,0,1,0,0)(1,1,1,1,0) sequential update: synchronous update: Abstracts many different application domains. Same input; different output: update scheme matters. 28

Resource Manager 29

Resource Manager Vision of Resource Manger: Find best fit available resource to allocate job Achieves load balancing and fault-tolerance – by health monitoring Resources made readily pluggable. 30

Resource Manager Integrated Resources: HPC Clusters Shadowfax (hosted by Virginia Tech) Contains more than 1100 processing cores Pecos (hosted by Virginia Tech) Contains more than 900 processing cores Cloud Infrastructures Openstack and Eucalyptus hosted in FutureGrid Contains more than 5000 processing cores Integrated resources have large processing ability Around 7000 processing cores in total Shared with other users 31

Fostering Research, Teaching, Collaboration CINet is being used for teaching purpose in Network Science related courses in different universities – SUNY Albany, NCAT, Virginia Tech, Univ. of Maryland CINet provides a platform for collaborations – Virginia Tech, Indiana Uni., SUNY Albany, Jackson State, Argonne National Lab, U. Chicago, NCAT Evaluators from different universities – SUNY Albany, Maryland, NCAT, Virginia Tech 32

Where is CINet heading The base funding is over; we plan to host 4 workshops to get user feedback. Hope to improve CINet using other ongoing programs New version of CINet will be deployed in a month: a new website is being created as well Each of the modules is being enhanced to some extent; additionally new modules are being added (E.g. general simulation of contagions on networks). Will add Giraph based implementations and algorithms we develop under DIBBs here. Main goal is to try and increase the user base -- CINet is ready to be used by the academic community. 33

Demonstration of Granite Go to: cinet.vbi.vt.educinet.vbi.vt.edu Go to: Tools -> Granite Username: demo Password: demo

Conclusion A CyberInfrastructure that – provides both static and dynamical network analysis tool – contains a bunch of network algorithms, network generators, real-world and artificial networks – is Self-sustainable and self-manageable – has lots of apps and services, e.g., visualization, memoization, add measures, networks, etc. – HPC Resources and resource manager CINet makes life easy for network scientists. 35

36 Thank You!