Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME.

Slides:



Advertisements
Similar presentations
Cartography of complex networks: From organizations to the metabolism Cartography of complex networks: From organizations to the metabolism Roger Guimerà.
Advertisements

Community Detection and Graph-based Clustering
A Tutorial on Learning with Bayesian Networks
Where we are Node level metrics Group level metrics Visualization
Spread of Influence through a Social Network Adapted from :
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
Analysis and Modeling of Social Networks Foudalis Ilias.
Feb 20, Definition of subgroups Definition of sub-groups: “Cohesive subgroups are subsets of actors among whom there are relatively strong, direct,
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Mining and Searching Massive Graphs (Networks)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Network Statistics Gesine Reinert. Yeast protein interactions.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Problem Addressed Attempts to prove that Web Crawl is random & biased image of Web Graph and does not assert properties of Web Graph Understanding the.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
San Francisco Bay Area News Ecology Daniel Ramos CS790G Fall 2010.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Application of Graph Theory to OO Software Engineering Alexander Chatzigeorgiou, Nikolaos Tsantalis, George Stephanides Department of Applied Informatics.
Computer Science 1 Web as a graph Anna Karpovsky.
The Very Small World of the Well-connected. (19 june 2008 ) Lada Adamic School of Information University of Michigan Ann Arbor, MI
Overview of Web Data Mining and Applications Part I
Clustering Vertices of 3D Animated Meshes
Models of Influence in Online Social Networks
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
EVENT MANAGEMENT IN MULTIVARIATE STREAMING SENSOR DATA National and Kapodistrian University of Athens.
Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
Victor Lee.  What are Social Networks?  Role and Position Analysis  Equivalence Models for Roles  Block Modelling.
Principles of Social Network Analysis. Definition of Social Networks “A social network is a set of actors that may have relationships with one another”
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Markov Cluster (MCL) algorithm Stijn van Dongen.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Comparing Snapshots of Networks Shah Jamal Alam and Ruth Meyer Centre for Policy Modelling 28 th March, 2007 – CAVES Bi-annual Meeting, IIASA,
Methods for mapping hyperlink networks: Examining the environment of Belgian news websites Juliette De Maeyer University of Brussels, Belgium (ULB) FNRS.
Network Community Behavior to Infer Human Activities.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
COMMUNITY DISCOVERY PART 1: A (BRIEF) INTRODUCTION Giulio Rossetti WMA - 4 May 2015.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Kijung Shin Jinhong Jung Lee Sael U Kang
Performance Evaluation Lecture 1: Complex Networks Giovanni Neglia INRIA – EPI Maestro 10 December 2012.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
Selected Topics in Data Networking Explore Social Networks:
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
Models of Web-Like Graphs: Integrated Approach
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Cohesive Subgraph Computation over Large Graphs
Groups of vertices and Core-periphery structure
Social Networks Analysis
Random walks on complex networks
Community detection in graphs
Assessing Hierarchical Modularity in Protein Interaction Networks
Local Clustering Coefficient
Modelling and Searching Networks Lecture 2 – Complex Networks
Presentation transcript:

Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME This research is funded by the FRS-FNRS Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim

Online communities – a theoretical definitions What is an online community? “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationship in cyberspace” » (Rheingold 2000) long term involvement (Jones 2006) sense of community (Blanchard 2008) temporal perspective (Lin et al 2006) Probably important … but the first operation should be to take into account the ‘hyperlink environment’  Graph analysis issue / SNA issue

Online Communities – A graphical definition (1) Community = more ties among members than with non-members three general classes of ‘community’ in graph partitioning algorithm (Fortunato 2010) : – a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce, 1950), k-plex (Seidman & Foster, 1978), lambda sets (Borgatti et al, 1990), … ) – a global definition: focus on the graph as a whole (observed graph significantly different from a random graph (i.e.: Erdös-Rényi graph)?) – vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical clustering, max-flow/min-cut (Elias et al, 1956; Flake et al, 2000)

Online communities – graphical definition (2) 2 main problems of graph partitionning in a hyperlink environment: 1) network size / and form (i.e. tree structure) 2) edges direction  better discover communities with a efficient web crawler

Web crawling - Generalities The general idea for a web crawling process: Source: Jacomi & Ghitalla (2007) - We have a number of starting blogs (seeds) - All hyperlink are retrieved from these seeds blogs - For each new website discovered, decide wether this new site is accepted or refused - If the site is accepted, it become a seed and the process is reiterated on this site.

Web crawling – constrain-based web crawler (1) Two problems of a manual crawler : Number and quality of decision Closure? A solution: taking advantage of local structural properties of a network:  Assume that a network is an outcome of the agregation of local social processes: – Examples in SNA: General philosphy of ERG Models (see f.e. : Robins et al 2007) Local clustering coefficient (see f.e. : Watts & Strogatz, 1998)  Constrain the crawler to identify local social structures (ie: triangles, mutual dyads, transitive triads,…

Web crawling – constrain-based web crawler (2) An example of a constrained web crawler based on identification of triangles Generalisation

Experimental results - method

Experimental results – results(1) Starting set: 6 « polititical ecological » blogs Remarks: dyad sampler and triplets samplers  closure Unsupervised and triangles samplers  manually stopped

Experimental results – results (2) Triangles Dyads Triplets

Unsupervised crawler is not manageable ( actors after 4 iterations!!) Dyads: did not selected ‘authoritative’ sources + sensitive to the number of seeds ? Triplets seems to be the best solution: take ties direction into account + take profit of authoritative sources + conservative Triangles: problem of network size … but sampled network can have interesting properties.

Conclusion and further researches Pitfalls to avoid: Not necessary all relevant information in the core: there is a lot of information in the periphery of this core. Based on human behaviour patterns: not adapted at all for other kind of networks (words occurencies, proteïns chains,…) Do not throw away more classical graph partitionning methods Always question your results. How to assess efficiency of a crawler? Should communities in web graph always be topic-centered Further researches: Analysis and detection of ‘multi-core’ networks ‘Random walks’ in complete networks to find recursive patterns using T.C. assumptions Code of the samplers in ‘R’