Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,

Slides:



Advertisements
Similar presentations
PRAGMA – 9 V.S.S.Sastry School of Physics University of Hyderabad 22 nd October, 2005.
Advertisements

IPv4 Run Out and Transitioning to IPv6 Marco Hogewoning Trainer, RIPE NCC.
Monte Carlo Methods and Statistical Physics
The Connectivity and Fault-Tolerance of the Internet Topology
Outline input analysis input analyzer of ARENA parameter estimation
Experimental Design, Response Surface Analysis, and Optimization
Spectrum Based RLA Detection Spectral property : the eigenvector entries for the attacking nodes,, has the normal distribution with mean and variance bounded.
The General Linear Model. The Simple Linear Model Linear Regression.
Topology Generation Suat Mercan. 2 Outline Motivation Topology Characterization Levels of Topology Modeling Techniques Types of Topology Generators.
The loss function, the normal equation,
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
1 Mazes In The Theory of Computer Science Dana Moshkovitz.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Transforming to Achieve Linearity
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
The Erdös-Rényi models
Random Number Generators CISC/QCSE 810. What is random? Flip 10 coins: how many do you expect will be heads? Measure 100 people: how are their heights.
Gaussian process modelling
Mathematical Processes GLE  I can recognize which symbol correlates with the correct term.  I can recall the correct definition for each mathematical.
T-test Mechanics. Z-score If we know the population mean and standard deviation, for any value of X we can compute a z-score Z-score tells us how far.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
STAT 203 Elementary Statistical Methods. Review of Basic Concepts Population and Samples Variables and Data Data Representation (Frequency Distn Tables,
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Finding dense components in weighted graphs Paul Horn
Monte Carlo Simulation and Personal Finance Jacob Foley.
CS433 Modeling and Simulation Lecture 16 Output Analysis Large-Sample Estimation Theory Dr. Anis Koubâa 30 May 2009 Al-Imam Mohammad Ibn Saud University.
Statistics and Quantitative Analysis U4320 Segment 8 Prof. Sharyn O’Halloran.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Probability and Statistics Required!. 2 Review Outline  Connection to simulation.  Concepts to review.  Assess your understanding.  Addressing knowledge.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Monte Carlo Methods Versatile methods for analyzing the behavior of some activity, plan or process that involves uncertainty.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Defining Success Understanding Statistical Vocabulary.
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Random Graph Generator University of CS 8910 – Final Research Project Presentation Professor: Dr. Zhu Presented: December 8, 2010 By: Hanh Tran.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Sampling and estimation Petter Mostad
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Introduction to Engineering Calculations Chapter 2.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Models of Web-Like Graphs: Integrated Approach
Computacion Inteligente Least-Square Methods for System Identification.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
IPv6 Matrix Project - Page 1 IPv6 Matrix Project Tracking IPv6 connectivity Worldwide Dr. Olivier MJ.
Modelling Complex Systems Video 4: A simple example in a complex way.
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
Stochastic Streams: Sample Complexity vs. Space Complexity
Ch9: Decision Trees 9.1 Introduction A decision tree:
Al-Imam Mohammad Ibn Saud University Large-Sample Estimation Theory
Analytics and OR DP- summary.
Optimization Techniques for Natural Resources SEFS 540 / ESRM 490 B
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Degree Distributions.
Q/ Compare between HTTP & HTTPS? HTTP HTTPS
Presentation transcript:

Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference, New York, USA, Sept , 2003.

Web Sampling In order to study the web we have to crawl it We can’t exhaustively crawl the whole web because i) it is very big ii) it grows exponentially We rather use sampling techniques to collect representative samples (pages) of the web and then study these pages 2 main methods of web sampling i) “stochastic sampling” (random walks) ii) “deterministic sampling” (IP sampling)

Stochastic Sampling A Stochastic sampler starts from a node of the web graph ( pages-nodes, links-edges ), picks ( with some probability ) a link in that node, follows it and visits another node etc. The sampler stops when it reaches equilibrium distribution ( if the transition matrix of the process is P and the sampler is at state π, then equilibrium distribution is a state which π=πP ) and outputs the sample ( all the visited nodes ) Problems are i) We need connectivity ( links ) between nodes ii) We don’t know how to choose a node uniformly at random to start the stochastic sampler iii) We don’t know how long does it take to reach equilibrium distribution iv) There is statistical dependency among the nodes that the sampler visits ( no clean statistics )

Deterministic Sampling A deterministic sampler does not sample the web graph but the IPv4 ( Internet Protocol version 4 ) adress space The sampler collects IPs from the IPv4 space ( pre-sample ) and converts them into their web representation ( final-sample ) Problems are i) difficulties in accessing many hosts when converting the IP addresses into web nodes ii) multihosting ( one IP may belong to various web nodes but the resolution mechanism shows only one node ) iii) scalability problems ( the new internet IPv6 )

Sampling a web Subgraph 1/4 Say we want to study a web subgraph ( say a country code Top Level Domain.gr,.uk etc. ) We can’t use a stochastic sampler since if we start it from a node inside the domain the sampler is not going to stay there ( also if we force the sampler to stay inside we ruin the stochasticity of the process) We can’t also use as it is a deterministic sampler since IPv4 is a huge pool of IPs and our subgraph contains only a small part of them In this work we built a modified deterministic sampler that solves the above problem

Sampling a web Subgraph 2/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler gets as input the IP addresses of the subgraph ( population ). The IPs of the subgraph are collected from Regional Internet Registries ( such as RIPE )

Sampling a web Subgraph 3/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler uses sampling theory to compute the size of the sample, produces the appropriate amount of random numbers and draw a pre-sample of IP addresses

Sampling a web Subgraph 4/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler resolves the pre-sample and outputs the final sample that contains web nodes ( pages )

Testing the Sampler (test 1) Define a variable Then is the total number of web nodes in N An estimator of the percentage of web nodes in N is The size n of the sample we need to draw in order to estimate p with error of magnitude B is ( q=1-p ) We want to predict the % of web nodes in a domain (.gr ) and say that inside this domain there exist N IPs. Some of them are web nodes while some other are not From above we estimate that in late 2002 which agrees with RIPE statistics for the same period

Testing the Sampler (test 2) The out degree distribution of the sample obeys a power law which is an intrinsic property of the web graph Out degrees, InTree links chopped Fit: 11, x (x,y)=(Log degree, Log rank) The roughly linear plot is skewed in y=4 and this is due to a porn site with hundreds of repetitions of the same links

Uses of the sampler The above sampler ii) can be used as input to stochastic samplers which need to start from random sets of web nodes iii) can be used as a crawler if we force it not to draw samples, but to exhaustively visit all the IP addresses that we give to it i) can help us collect information about web communities or validate laws in internet domains