Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure.

Slides:



Advertisements
Similar presentations
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Advertisements

Scale Free Networks.
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Analysis and Modeling of Social Networks Foudalis Ilias.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Jure Leskovec Joint work with Eric Horvitz, Microsoft Research.
Information Networks Small World Networks Lecture 5.
Advanced Topics in Data Mining Special focus: Social Networks.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Emergence of Scaling in Random Networks Barabasi & Albert Science, 1999 Routing map of the internet
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
Small-World Graphs for High Performance Networking Reem Alshahrani Kent State University.
Small Worlds Presented by Geetha Akula For the Faculty of Department of Computer Science, CALSTATE LA. On 8 th June 07.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Advanced Topics in Data Mining Special focus: Social Networks.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Computer Science 1 Web as a graph Anna Karpovsky.
Social Media Mining Graph Essentials.
Research Meeting Seungseok Kang Center for E-Business Technology Seoul National University Seoul, Korea.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Graph Theory in 50 minutes. This Graph has 6 nodes (also called vertices) and 7 edges (also called links)
Jure Leskovec Joint work with Eric Horvitz, Microsoft Research.
Jure Leskovec, CMU Eric Horwitz, Microsoft Research.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Microsoft Instant Messenger Communication Network How does the world communicate? Jure Leskovec Machine Learning Department
Self-Similarity of Complex Networks Maksim Kitsak Advisor: H. Eugene Stanley Collaborators: Shlomo Havlin Gerald Paul Zhenhua Wu Yiping Chen Guanliang.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.
Complex Network Theory – An Introduction Niloy Ganguly.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Complex Network Theory – An Introduction Niloy Ganguly.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
+ Big Data, Network Analysis Week How is date being used Predict Presidential Election - Nate Silver –
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Informatics tools in network science
Analyzing Networks. Milgram’s Experiments “Six degrees of Separation” Milgram’s letters to various recruits in Nebraska who were asked to forward the.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Information Retrieval Search Engine Technology (10) Prof. Dragomir R. Radev.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
Response network emerging from simple perturbation Seung-Woo Son Complex System and Statistical Physics Lab., Dept. Physics, KAIST, Daejeon , Korea.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Lecture II Introduction to complex networks Santo Fortunato.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
A Place-based Model for the Internet Topology Xiaotao Cai Victor T.-S. Shi William Perrizo NDSU {Xiaotao.cai, Victor.shi,
Cmpe 588- Modeling of Internet Emergence of Scale-Free Network with Chaotic Units Pulin Gong, Cees van Leeuwen by Oya Ünlü Instructor: Haluk Bingöl.
Social Networks Some content from Ding-Zhu Du, Lada Adamic, and Eytan Adar.
Lecture 23: Structure of Networks
Topics In Social Computing (67810)
Applications of graph theory in complex systems research
Introduction to Web Mining
Community detection in graphs
Lecture 23: Structure of Networks
Network Science: A Short Introduction i3 Workshop
Peer-to-Peer and Social Networks Fall 2017
Lecture 23: Structure of Networks
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI

 What are networks? ◦ …few examples  Network properties ◦ Small worlds ◦ Power law ◦ Long tail ◦ Network Resilience ◦ Structure of networks  Applications ◦ Mining server logs ◦ Mining MSN Messenger data

Statistics Computer systems Theory and algorithms (complex) networks Machine learning / Data mining 3

Statistics Computer systems Theory and algorithms (complex) networks Machine learning / Data mining Social Sciences Biology Physics (complex) networks Industry & Applications Computer Science 4

Vertex / Node

Edge/ Link

Vertex / Node Edge/ Link Direction

Vertex / Node Edge/ Link Direction Probabilities

Vertex / Node Edge/ Link Direction Probabilities …in dynamic networks all the elements of the graph are changing …dealing with dynamic networks is active research topic

Query Active topic during limited time period Example of Dynamic Graph (1/3)

On Clinton and Chicago are connected Example of Dynamic Graph (2/3)

On Clinton and Chicago are NOT connected Example of Dynamic Graph (3/3)

 Information networks: ◦ World Wide Web: hyperlinks ◦ Citation networks ◦ Blog networks  Social networks: people + interactios ◦ Organizational networks ◦ Communication networks ◦ Collaboration networks ◦ Sexual networks ◦ Collaboration networks  Technological networks: ◦ Power grid ◦ Airline, road, river networks ◦ Telephone networks ◦ Internet ◦ Autonomous systems Florence families Karate club network Collaboration network Friendship network

 Biological networks ◦ metabolic networks ◦ food web ◦ neural networks ◦ gene regulatory networks  Language networks ◦ Semantic networks  Software networks  … Yeast protein interactions Semantic network Language network XFree86 network

 Directed/undirected  Multi graphs (multiple edges between nodes)  Hyper graphs (edges connecting multiple nodes)  Bipartite graphs (e.g., papers to authors)  Weighted networks  Different type nodes and edges  Evolving networks: ◦ Nodes and edges only added ◦ Nodes, edges added and removed

 Sociologists were first to study networks: ◦ Study of patterns of connections between people to understand functioning of the society ◦ People are nodes, interactions are edges ◦ Questionares are used to collect link data (hard to obtain, inaccurate, subjective) ◦ Typical questions: Centrality and connectivity  Limited to small graphs (~10 nodes) and properties of individual nodes and edges

 Large networks (e.g., web, internet, on-line social networks) with millions of nodes  Many traditional questions not useful anymore: ◦ Traditional: What happens if a node U is removed? ◦ Now: What percentage of nodes needs to be removed to affect network connectivity?  Focus moves from a single node to study of statistical properties of the network as a whole  Can not draw (plot) the network and examine it

 How the network “looks like” even if I can’t look at it?  Need for statistical methods and tools to quantify large networks  3 parts/goals: ◦ Statistical properties of large networks ◦ Models that help understand these properties ◦ Predict behavior of networked systems based on measured structural properties and local rules governing individual nodes

 Features common to networks of different types: ◦ Properties of static networks:  Small-world effect  Transitivity or clustering  Degree distributions (scale free networks)  Network resilience  Community structure  Subgraphs or motifs ◦ Temporal properties:  Densification  Shrinking diameter

 Six degrees of separation (Milgram 60s) ◦ Random people in Nebraska were asked to send letters to stockbrokes in Boston ◦ Letters can only be passed to first-name acquantices ◦ Only 25% letters reached the goal ◦ But they reached it in about 6 steps  Measuring path lengths: ◦ Diameter (longest shortest path): max d ij ◦ Effective diameter: distance at which 90% of all connected pairs of nodes can be reached ◦ Mean geodesic (shortest) distance l

 Empirical observation for the Web-Graph is that the diameter of the Web-Graph is small relative to the size of the network ◦ …this property is called “Small World” ◦ …formally, small-world networks have diameter exponentially smaller then the size  By simulation it was shown that for the Web- size of 1B pages the diameter is approx. 19 steps ◦ …empirical studies confirmed the findings

 The network represents collaboration between institutions on FP5-IST projects funded by European Union ◦ …there are 7886 organizations collaborating on 2786 projects ◦ …in the network, each node is an organization, two organizations are connected if they collaborate on at least one project  Small world properties of the collaboration network: ◦ Main connected part of the network contains 94% of the nodes ◦ Max distance between any two organizations is 7 steps … meaning that any organization can be reached in up to 7 steps from any other organization ◦ Average distance between any two organizations is 3.15 steps (with standard deviation 0.38) ◦ 38% (2770) of organizations have avg. distance 3 or less

1856 collaborations avg. distance is 1.95 max. distance is 4

179 collaborations avg. distance is 2.42 max. distance is 4

8 collaborations max. distance is 7

 Distribution of shortest path lengths  Microsoft Messenger network ◦ 180 million people ◦ 1.3 billion edges ◦ Edge if two people exchanged at least one message in one month period Distance (Hops) Number of nodes Pick a random node, count how many nodes are at distance 1,2,3... hops 7

 Power law describes relations between the objects in the network ◦ …it is very characteristic for the networks generated within some kind of social process ◦ …it describes scale invariance found in many natural phenomena (including physics, biology, sociology, economy and linguistics)

 In the context of Web the power-law appears in many cases: ◦ Web pages sizes ◦ Web page connectivity ◦ Web connected components’ size ◦ Web page access statistics ◦ Web Browsing behavior  Formally, power law describing web page degrees are: (This property has been preserved as the Web has grown)

Degree distribution number of people a person talks to on a Microsoft Messenger Node degree Count X Highest degree

This is not directly related to graphs, but it nicely explains the “long tail” effect. It shows that there is big market for niche products.

 We observe how the connectivity (length of the paths) of the network changes as the vertices get removed  It is important for epidemiology ◦ Removal of vertices corresponds to vaccination  Real-world networks are resilient to random attacks ◦ One has to remove all web- pages of degree > 5 to disconnect the web ◦ …but this is a very small percentage of web pages  Random network has better resilience to targeted attacks

 What are the building blocks (motifs) of networks?  Do motifs have specific roles in networks?  Network motifs detection process: ◦ Count how many times each subgraph appears ◦ Compute statistical significance for each subgraph – probability of appearing in random as much as in real network 3 node motifs

 Biological networks ◦ Feed-forward loop ◦ Bi-fan motif  Web graph: ◦ Feedback with two mutual diads ◦ Mutual diad ◦ Fully connected triad

 Intuition says that distances between the nodes slowly grow as the network grows (like log n )  But as the network grows the distances between nodes slowly decrease Internet Citations

 In November 1999 large scale study using AltaVista crawls in the size of over 200M nodes and 1.5B links reported “bow tie” structure of web links ◦ …we suspect, because of the scale free nature of the Web, this structure is still preserved

SCC - Strongly Connected component where pages can reach each other via directed paths IN – consisting from pages that can reach core via directed path, but cannot be reached from the core OUT – consisting from pages that can be reached from the core via directed path, but cannot reach core in a similar way TENDRILS – disconnected components reachable only via directed path from IN and OUT but not from and to core

 We address the problem how to construct a taxonomy from a social network data. ◦ …we adapt the approach used when dealing with text  As an example we use graph in a mid size research institution ◦...communication records of JSI 770 people  The experiments and evaluation show our approach to be useful and applicable in real life situations ◦ …the approach could be easily reused in case studies (and elsewhere)

 The main contribution of the deliverable is architecture & software consisting from 5 major steps: 1.Starting with log files from the institutional server where the data include information about transactions with three fields: time, sender and the list of receivers. 2.After cleaning we get the data in the form of transactions which include addresses of sender and receiver. 3.From a set of transactions we construct a graph where vertices are addresses connected if there is a transaction between them 4. graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations 5.Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where s came from.

 The data is the collection of log files with e- mail transactions from local spam filter software Amavis ( ◦ Each line of the log files denotes one event at the spam filter software ◦ We were interested in the events on successful e- mail transactions ...having information on time, sender, and list of receivers ◦ An example of successful transaction is the following line:  2005 Mar 28 13:59:05 patsy amavis[33972]: ( ) Passed CLEAN, [ ] [ ] ->, Message-ID:, Hits: , 6389 ms

 The log files include s data from Sep 5th 2003 to Mar 28th 2005: ◦ …this sums up to 12.8Gb of data. ◦ After filtering out successful transactions it remains 564Mb  …which contains approx. 2.7 million of successful transitions used for further processing ◦ The whole dataset contains references to approx addresses  …after the data cleaning phase the number is reduced to approx addresses  …out of which 770 addresses are internal from the home institution (with “ijs.si” domain name)

Organizational structure of JSI produced from cleaned transactions with OntoGen in <5 minutes

Organizational structure of JSI visualized from transactions with Document-Atlas

Part of clustering results for “Jozef Stefan Institute” e- mail data into 10 clusters (C-0, C-1, …C-9) showing distribution of the clustered s over the Institute departments.

By Jure Leskovec

 For every conversation (session) we have a list of users who participated in the conversation  There can be multiple people per conversation  For each conversation and each user: ◦ User Id ◦ Time Joined ◦ Time Left ◦ Number of Messages Sent ◦ Number of Messages Received

 For every user (self reported): ◦ Age ◦ Gender ◦ Location (Country, ZIP) ◦ Language ◦ IP address (we can do reverse GeoIP lookup)

 150 GB compressed logs per day ◦ Just copying over the network takes 8 to 10 hours ◦ Parsing and processing takes another 4 to 6 hours  After parsing, collapsing, saving as binary and compressing ~ 40GB per day  Collected data for all of June 2006:  1.3TB of data

User age distribution (self reported) Age Count

Number of participants in the conversation Conversation size Count Limit of 20 users per session

 Data for June 1: ◦ 982,005,323 sessions (conversations) ◦ 980,219,231 2-user conversations ◦ 471,837,591 conversations with 0 exchanged messages ◦ 508,315,719 “good” sessions ◦ 63,949,711 different users talking ◦ 65,921 unknown users talking (users which never login)

 Over June 2006:  242,720,596 users logged in  179,792,538 users engaged in conversations  17,510,905 new users (never logged in before)  More than 30 billion conversations

Age High Low

High Low Age

High Low Age

High Low Age

Where are the users coming from?

 Using only 2-user conversations from June 2006 we build a graph: ◦ 179,792,538 nodes ◦ 1,342,246,427 edges ◦ 15,010,572,090 2-user conversations

Distance (Hops) Number of nodes Pick a random node, count how many nodes are at distance 1,2,3... hops HopsNodes

 In ACTIVE we will perform analytics along three main dimensions: ◦ content (text, tags, semi-structured data) ◦ social network (graph of social linkages) ◦ time  Content dimensions is well studies and covered by many text-mining methods  …static social network analysis aspect will be covered well by the existing methods  …core research will happen on “dynamic social networks”

 Network analysis is very active research topic on the intersection of several areas ◦ …the area deals primarily with graph representation, fundamental to many problems in the nature and society ◦ …currently hot research topic in network analysis is dealing with “dynamic networks” ◦ …in ACTIVE we will perform research and provide solutions for large dynamic social networks extracted from enterprise data