Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.

Slides:



Advertisements
Similar presentations
Measurement and Analysis of Online Social Networks 1 A. Mislove, M. Marcon, K Gummadi, P. Druschel, B. Bhattacharjee Presentation by Shahan Khatchadourian.
Advertisements

Traffic-driven model of the World-Wide-Web Graph A. Barrat, LPT, Orsay, France M. Barthélemy, CEA, France A. Vespignani, LPT, Orsay, France.
Analysis and Modeling of Social Networks Foudalis Ilias.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Weighted networks: analysis, modeling A. Barrat, LPT, Université Paris-Sud, France M. Barthélemy (CEA, France) R. Pastor-Satorras (Barcelona, Spain) A.
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
WEB GRAPHS (Chap 3 of Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2005/10/6.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Interconnect Implications of Growth-Based Structural Models for VLSI Circuits* Chung-Kuan Cheng, Andrew B. Kahng and Bao Liu UC San Diego CSE Dept.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Computer Science 1 Web as a graph Anna Karpovsky.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS.
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Traceroute-like exploration of unknown networks: a statistical analysis A. Barrat, LPT, Université Paris-Sud, France I. Alvarez-Hamelin (LPT, France) L.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Percolation in self-similar networks Dmitri Krioukov CAIDA/UCSD M. Á. Serrano, M. Boguñá UNT, March 2011.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
A Graph-based Friend Recommendation System Using Genetic Algorithm
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Mathematics of Networks (Cont)
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Robustness of complex networks with the local protection strategy against cascading failures Jianwei Wang Adviser: Frank,Yeong-Sung Lin Present by Wayne.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
General Writing - Audience What is their level of knowledge? Advanced, intermediate, basic? Hard to start too basic – but have to use the right terminology.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
Percolation in self-similar networks PRL 106:048701, 2011
Data Structures & Algorithms Graphs Richard Newman based on book by R. Sedgewick and slides by S. Sahni.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
Models of Web-Like Graphs: Integrated Approach
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.
22C:145 Artificial Intelligence
Introduction to Web Mining
Modeling, sampling, generating Networks with MRV
Uniform Sampling from the Web via Random Walks
Generative Model To Construct Blog and Post Networks In Blogosphere
The likelihood of linking to a popular website is higher
Peer-to-Peer and Social Networks
CS246: Web Characteristics
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro Vespignani

AGENDA INTRODUCTION MAIN OBJECTIVE MAIN CONTRIBUTIONS PREVIOS RESEARCH AND RESULTS RELATED WORK DONE AND RESULTS IMPLEMENATION CONCLUSION AND FUTURE WORK

INTRODUCTION Topological Structure of the “World Wide Web” can be represented by the properties of it’s representative graphs. Vertices identified with Web Pages and Directed Edges identified with Hyperlinks. The Comparative Analysis of different WWW graphs differ Quantitatively, Qualitatively depending on the domain and crawl used for gathering data. Degree Distribution, Degree-Degree Correlation Functions, Statistics of Reciprocal Connections are used as measurement for the analysis of Web Graphs.

CONT. The dynamical nature of the web and its huge size make very difficult the process of compressing, ranking, indexing or mining the web. Exact policies and strategies followed by Crawl engines helps in decoding the huge structure of WWW.

MAIN OBJECTIVE Give the clear picture on the reliability of the widely accepted large-scale statistical properties of the Web and provide the new measures to discover whether or not inconsistencies are found when measuring the same properties across different crawls.

MAIN CONTRIBUTIONS A careful comparative analysis of the structural and statistical large-scale different Web graphs, with evident qualitative and quantitative differences across different samples. Introduced single and two-vertex correlations and Reciprocal Linksfor a full connectivity pattern and structural ordering of the Web graph. And these all properties depend on the communication patterns among the constituent sites of the network.

PREVIOUS RESEARCHES AND RESULTS Measurement of the Directed Degree Distributions P(kin) and P(kout), where the in/out-degree, kin or kout respectively, is defined as the number of incoming/outgoing links connecting a page to its neighbors. Kumar et al. [1999] worked on a big crawl of about 40M nodes, and that by Barab´ asi and Albert [1999] on a smaller set of over 0.3M nodes restricted to the domain of the University of Notre Dame, resulted a scale-free nature for the WWW with power-law behaviors both for the in- and out-degree distributions.

CONT. Border et al. [2000] worked on two sets from AltaVista crawls, in May and October for year 1999 with 200 million pages and 1.5 billion links. And concluded that the structure of the Web was relatively insensitive to the particular large crawl used and the connectivity structure was resilient to the removal of a significant number of nodes. Donato et al. [2004] worked on the same lines with a large 2001 data set of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford. The obtained results were compared with the ones presented in the work by Broder et al.[2000]. One of the reported differences is the deviation from the power-law behavior of the out-degree distribution.

RELATED WORK DONE AND RESULTS Analysis and Comparison for Four Data Sets from years, from 2001 to 2004, and different domains, general and.uk and.it domains. The sets have been gathered within two different projects: theWebBase project and the WebGraph project, with own Web crawler, WebVac and UbiCrawler, respectively. While pages in the.uk domain have higher probability to point to pages outside the domain,due to English, and the links in the Italian.it domain may be much more endogenous, which could potentially have a high effect on the Web description derived from the data.

CONT. Table I. Number of Nodes and Edges of the Networks Considered, After Extracting Multiple Links and Self-Connections Measurements carried out during research: 1.Structural Properties 2.Degree Correlation 3.The Role of Reciprocal Links Data setWBGC01WGUK02WBGC03WGIT04 # nodes80,571,24718,520,48649,296,313 41,291,594 # links752,527,660292,243,6631,185,396,9531,135,718,909

Structural Properties Connected Components 1.Strongly Connected Component : All pages mutually connected by a path. 2.In-Component: Vertices from which it is possible to reach SCC using directed path. 3.Out-Component: Vertices which can be reached from SCC using directed path. 4.Tendrils: Pages which cannot reach the SCC and cannot be reached from it. 5.Tubes: Directly connect the IN and OUT components without crossing the SCC.

CONT. Table II Sizes of the IN,OUT, SCC and their union MAIN

CONT. Crawlers perform a directed exploration that they follow outgoing hyperlinks to reach pointed pages, but cannot navigate backwards using incoming hyperlinks. In summary, the structure of Web graphs is strongly dependent on the data set considered Degree Distribution For directed networks, the in-degree distribution P(kin) and the out-degree distribution P(kout),probabilities of having kin incoming links and kout outgoing links, respectively. In-degree of a vertex is the sum of all the hyperlinks incoming from all the Web pages in the WWW. And there is no limit to the number of incoming hyperlinks, that is determined only by the popularity of the Web page itself. Out-degree is determined by the number of hyperlinks present in the page, which are controlled by Web administrators.

DEGREE CORRELATION Degree Correlation between In and Out degree can determine that the n/w will or will not have a bow structure and for the model validation quantitatively. 1.Single Vertex Degree Correlation Shows that more popular pages tend to point to a higher number of other pages. This positive correlation is found to be true for a range of in-degrees that spans from kin = 1 to kin =102 ∼ 103, depending on the specific set. The set for the Italian domain is more noisy, but this pattern appears to be independent of the crawl used to gather the data.

CONT. 2. Two Vertex Degree Correlation The implication from the Two Vertex Degree Correlation can help in the study of Page Rank, as this includes the neighborhood of each single node i into neighboring nodes connected to it by incoming and outgoing links. And it shows the popularity of the web page in terms of the number of pages pointing to them.

ROLE OF RECIPROCAL LINKS It plays an important role in percolation catalysts, the fine structure of the web components and the navigability of the web. 1.Degree Distribution: Depends on the crawl examined. 2.One Degree Correlation: No clear relation between nonreciprocal in-and out- degrees but there is a positive correlation between reciprocal and nonreciprocal in-degrees. 3.Degree-Degree Correlation: Shows most of the correlations of web graphs are found in vertices connected by reciprocal links which depends on the web structure. Reciprocal Sub graph: Shows the organization of the reciprocal sub graph is a set of star like structures combined with cliques, or communities, of highly interconnected pages.

Statistical Properties Of RECIPROCAL SUBGRAPH Average Degree qr, Maximum Degree qmaxr, Standard Deviation σr, Heterogeneity Parameter κr, and Maximum Likelihood Estimate of the Exponent of the Power-Law in- Degree Distribution γr (Precision Error ±0.1) (The symbol ∞ means that the distribution decays faster than a power-law.)

CONCLUSION AND FUTURE WORK

CONT. Despite an approximate view of the web from data provided by Web Crawlers, still lacking an exact definitive description of its large scale properties and architecture which can affect the navigation, indexing searching and mining. FUTURE WORK: Differences among crawls should be further investigated in relation to crawling policies adopted in designing of the engines. The Reciprocal links role has to be explored in detail.