Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Slides:



Advertisements
Similar presentations
What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo.
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Introduction to Web Science Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Kira Radinsky, Sagie Davidovich, Shaul Markovitch Computer Science Department Technion – Israel Institute of technology.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Asking Questions on the Internet
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
CS 345A Data Mining Lecture 1
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Computer Science 1 Web as a graph Anna Karpovsky.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Patterns And A Generative Model Jan 24, 2014 Authors: Jianwei Niu, Wanjiun Liao, Jing Peng, Chao Tong Presenter: Guoming Wang Published: Performance Computing.
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Web Characterization: What Does the Web Look Like?
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
The Shape of the Web So, the Web is a directed graph, but what does it look like?
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Mathematics of Networks (Cont)
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Models of Web-Like Graphs: Integrated Approach
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
22C:145 Artificial Intelligence
Cohesive Subgraph Computation over Large Graphs
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Introduction to Web Mining
GANG: Detecting Fraudulent Users in OSNs
CS246: Web Characteristics
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Building Topic/Trend Detection System based on Slow Intelligence
Introduction to Web Mining
Stable and Practical AS Relationship Inference with ProbLink
CS 345A Data Mining Lecture 1
Presentation transcript:

Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo

Web as a projection of the world Web is now reflecting various events in the real and virtual world Evolution of past topics can be tracked by observing the Web new information new trendsIdentifying and tracking new information is important for observing new trends –Sociology, marketing, and survey research War Tsunami Sports Computer virus Online news weblogs BBS

Massive Periodic Crawling for Observing Trends on the Web Time T1T1 T2T2 TNTN Archive WWW Crawler Comparison

Web Archive

Observing Trends on the Web WebRelievo [Toyoda 2005] –Evolution of link structure

Issues of Observing Evolution Incompleteness of snapshots ・ Cannot crawl the entire Web ・ Time of creation/deletion is uncertain for many pages Spam & mirror sites ・ Increasing spam sites that deceive SEs (9% to 25% sites) ・ Many mirror sites (22% to 29% pages) Example of link spamming

What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots [WWW2006] Masashi Toyoda and Masaru Kitsuregawa IIS, University of Tokyo

Problems in Massive Periodic Crawling The whole of the Web cannot be crawled –# of uncrawled pages overwhelms # of crawled pages even after crawling 1B pages [Eiron et al 2004] Web sites may be temporarily unavailable –Server and network troubles Novelty of a page crawled for the first time remains uncertain –The page might exist at the previous time –“Last-Modified” time guarantees only that the page is older than that time

Our Contribution novelty measurePropose a novelty measure for estimating the certainty that a newly crawled page is really new –New pages can be extracted from a series of unstable snapshots Evaluate the precision, recall, and miss rate of the novelty measure Apply the novelty measure to our Web time machine application

Novelty Measure Old and Unknown Pages

Novelty Measure L 2 (t) If all in-links come from pages crawled last 2 times(L 2 (t)) p t-1 t N(p) N(p) ≒ 1 Crawled last 2 times L 2 (t) New

Novelty Measure N(p) is discounted when the novelty of some in-links are unknown q p t-1 t ? N(p) N(p) ≒ 0.75 New

Novelty Measure U(t) If some in-links come from U(t) ? q p t-1 t ? N(p) N(p) ≒ ?

Novelty Measure Determine the novelty measure recursively q p t-1 t N(p) N(p) ≒ ( ) / 4 N(q) N(q) ≒

Definition of Novelty Measure δ: damping factor p –probability that there were links to p before t-1

Experiment: Data Set A massively crawled Japanese web archive TimePeriodCrawled pagesLinks 1999Jul to Aug17M120M 2000Jun to Aug17M112M 2001Oct40M331M 2002Feb45M375M 2003Feb66M1058M 2003Jul97M1589M 2004Jan81M3452M 2004May96M4505M

Experiment : Precision pN(p)Given threshold θ, p is judged to be novel when θ< N(p) –Precision: #(correctly judged) / #(judged to be novel) –Recall: #(correctly judged) / #(all novel pages) Use URLs including dates as a golden set –Assume that they appeared at their including time –E.g. –Patterns: YYYYMM, YYYY/MM, YYYY-DD

Experiment: Precision Precision jumps from the baseline when θ becomes positive, then gradually increases Positive novelty provides 80% to 90% precision

A Large-Scale Study of Link Spam Detection by Graph Algorithms AIRWEB’07 Workshop, WWW2007 Hiroo Saito University of Tokyo. JST, ERATO Masashi Toyoda University of Tokyo Masaru Kitsuregawa University of Tokyo Kazuyuki Aihara University of Tokyo. JST, ERATO

Outline Propose a link farm detection method using graph algorithms Distribution of detected link farms in the Web graph structure 1. SCC decomposition 2. Maximal clique enumeration 3. Minimum cut Link farms are expanded by min-cut. How many links for cutting them out? Around the largest SCC (CORE), large SCCs are link farms Link farms in CORE can be found as maximal cliques CORE

Dataset Japanese Web archive crawled in May 2004 –96 million pages, 4.5 billion links –60% pages in Japanese, 40% in other languages Site graph –Top of site: URL linked from 3 or more servers –A site is a set of URLs below the top URL –5.9 million sites, 283 million links Domains Degree

SCC decomposition Size distribution follows the power-law (1 ≦ n ≦ 100) with a long and thick tail Large SCCs are spams (100<n) –552 SCCs, 0.57M sites –550 sample sites Sampling results

D istribution of SCCs in the bow tie Bow-tie structure [Broder et al. 2000] Distribution of large SCCs –450 / 552 (81%) SCCs in OUT –385 / 450 (85%) SCCs directly connected to CORE CORE has many spam sites connecting to them SCCs whose size are larger than 1,000 CORE1.78M30% IN0.05M1% OUT3.50M60% Tendrils0.14M2% Disc.0.40M7% CORE IN OUT 60% CORE 30% IN 1% TENDRILS 2% OTHERS 7%

Maximal clique enumeration Use maximal cliques for extracting spam from CORE –Link farms tend to include cliques Maximal clique enumeration [Makino,Uno 2004] –Ignore nodes with high degree (80<d) Because of O(max. degree^4) –Large cliques are link farms (40 < n) 26,931 maximal cliques, 8,346 sites (many duplicates) 165 sample sites Sampling result

Minimum cut How many spam sites around large SCCs and cliques? How many links for cutting off spam sites? Apply max-flow / min-cut on the directed site graph Cliques SCCs Virtual source Virtual sink 210 white sites 8,000 sites 450,000 sites 57,000 sites Min-cut: 18,000 Sampling result CORE

Conclusions and future work An automatic link farm detection method –Based on graph algorithms Seed extraction: SCC and maximal clique Seed expansion: Max-flow / min-cut –High precision (95% ~ 99%) Distribution of link farms in the Web graph structure –Large SCCs around CORE, Maximal cliques in CORE –Only 18,000 links for cutting off 0.5M spam sites Future work –Improving recall (small SCCs, large cliques in CORE) –Experiments on other datasets