Download presentation
Presentation is loading. Please wait.
Published byAudrey Burke Modified over 9 years ago
1
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo
2
Web as a projection of the world Web is now reflecting various events in the real and virtual world Evolution of past topics can be tracked by observing the Web new information new trendsIdentifying and tracking new information is important for observing new trends –Sociology, marketing, and survey research War Tsunami Sports Computer virus Online news weblogs BBS
3
Massive Periodic Crawling for Observing Trends on the Web Time T1T1 T2T2 TNTN Archive WWW Crawler Comparison
4
Web Archive
5
Observing Trends on the Web WebRelievo [Toyoda 2005] –Evolution of link structure
6
Issues of Observing Evolution Incompleteness of snapshots ・ Cannot crawl the entire Web ・ Time of creation/deletion is uncertain for many pages Spam & mirror sites ・ Increasing spam sites that deceive SEs (9% to 25% sites) ・ Many mirror sites (22% to 29% pages) Example of link spamming
7
What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots [WWW2006] Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo
8
Problems in Massive Periodic Crawling The whole of the Web cannot be crawled –# of uncrawled pages overwhelms # of crawled pages even after crawling 1B pages [Eiron et al 2004] Web sites may be temporarily unavailable –Server and network troubles Novelty of a page crawled for the first time remains uncertain –The page might exist at the previous time –“Last-Modified” time guarantees only that the page is older than that time
9
Our Contribution novelty measurePropose a novelty measure for estimating the certainty that a newly crawled page is really new –New pages can be extracted from a series of unstable snapshots Evaluate the precision, recall, and miss rate of the novelty measure Apply the novelty measure to our Web time machine application
10
Novelty Measure Old and Unknown Pages
11
Novelty Measure L 2 (t) If all in-links come from pages crawled last 2 times(L 2 (t)) p t-1 t N(p) N(p) ≒ 1 Crawled last 2 times L 2 (t) New
12
Novelty Measure N(p) is discounted when the novelty of some in-links are unknown q p t-1 t ? N(p) N(p) ≒ 0.75 New
13
Novelty Measure U(t) If some in-links come from U(t) ? q p t-1 t ? N(p) N(p) ≒ ?
14
Novelty Measure Determine the novelty measure recursively q p t-1 t N(p) N(p) ≒ (3 + 0.5) / 4 N(q) N(q) ≒ 0.5 0.5
15
Definition of Novelty Measure δ: damping factor p –probability that there were links to p before t-1
16
Experiment: Data Set A massively crawled Japanese web archive TimePeriodCrawled pagesLinks 1999Jul to Aug17M120M 2000Jun to Aug17M112M 2001Oct40M331M 2002Feb45M375M 2003Feb66M1058M 2003Jul97M1589M 2004Jan81M3452M 2004May96M4505M
17
Experiment : Precision pN(p)Given threshold θ, p is judged to be novel when θ< N(p) –Precision: #(correctly judged) / #(judged to be novel) –Recall: #(correctly judged) / #(all novel pages) Use URLs including dates as a golden set –Assume that they appeared at their including time –E.g. http://foo.com/2004/05 –Patterns: YYYYMM, YYYY/MM, YYYY-DD
18
Experiment: Precision Precision jumps from the baseline when θ becomes positive, then gradually increases Positive novelty provides 80% to 90% precision
19
A Large-Scale Study of Link Spam Detection by Graph Algorithms AIRWEB’07 Workshop, WWW2007 Hiroo Saito University of Tokyo. JST, ERATO Masashi Toyoda University of Tokyo Masaru Kitsuregawa University of Tokyo Kazuyuki Aihara University of Tokyo. JST, ERATO
20
Outline Propose a link farm detection method using graph algorithms Distribution of detected link farms in the Web graph structure 1. SCC decomposition 2. Maximal clique enumeration 3. Minimum cut Link farms are expanded by min-cut. How many links for cutting them out? Around the largest SCC (CORE), large SCCs are link farms Link farms in CORE can be found as maximal cliques CORE
21
Dataset Japanese Web archive crawled in May 2004 –96 million pages, 4.5 billion links –60% pages in Japanese, 40% in other languages Site graph –Top of site: URL linked from 3 or more servers –A site is a set of URLs below the top URL –5.9 million sites, 283 million links Domains Degree
22
SCC decomposition Size distribution follows the power-law (1 ≦ n ≦ 100) with a long and thick tail Large SCCs are spams (100<n) –552 SCCs, 0.57M sites –550 sample sites Sampling results
23
D istribution of SCCs in the bow tie Bow-tie structure [Broder et al. 2000] Distribution of large SCCs –450 / 552 (81%) SCCs in OUT –385 / 450 (85%) SCCs directly connected to CORE CORE has many spam sites connecting to them SCCs whose size are larger than 1,000 CORE1.78M30% IN0.05M1% OUT3.50M60% Tendrils0.14M2% Disc.0.40M7% CORE IN OUT 60% CORE 30% IN 1% TENDRILS 2% OTHERS 7%
24
Maximal clique enumeration Use maximal cliques for extracting spam from CORE –Link farms tend to include cliques Maximal clique enumeration [Makino,Uno 2004] –Ignore nodes with high degree (80<d) Because of O(max. degree^4) –Large cliques are link farms (40 < n) 26,931 maximal cliques, 8,346 sites (many duplicates) 165 sample sites Sampling result
25
Minimum cut How many spam sites around large SCCs and cliques? How many links for cutting off spam sites? Apply max-flow / min-cut on the directed site graph Cliques SCCs Virtual source Virtual sink 210 white sites 8,000 sites 450,000 sites 57,000 sites Min-cut: 18,000 Sampling result CORE
26
Conclusions and future work An automatic link farm detection method –Based on graph algorithms Seed extraction: SCC and maximal clique Seed expansion: Max-flow / min-cut –High precision (95% ~ 99%) Distribution of link farms in the Web graph structure –Large SCCs around CORE, Maximal cliques in CORE –Only 18,000 links for cutting off 0.5M spam sites Future work –Improving recall (small SCCs, large cliques in CORE) –Experiments on other datasets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.