The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Slides:

Advertisements

Similar presentations

The Structure of the Web Mark Levene (Follow the links to learn more!)

Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

TrustRank Algorithm Srđan Luković 2010/3482

Analysis and Modeling of Social Networks Foudalis Ilias.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.

1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.

The PageRank Citation Ranking “Bringing Order to the Web”

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Link Analysis, PageRank and Search Engines on the Web

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005

1 A Random-Surfer Web-Graph Model Avrim Blum, Hubert Chan, Mugizi Rwebangira Carnegie Mellon University.

1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.

Computer Science 1 Web as a graph Anna Karpovsky.

Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.

Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Mathematics of Networks (Cont)

Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Bruno Ribeiro Don Towsley University of Massachusetts Amherst IMC 2010 Melbourne, Australia.

Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.

Algorithmic Detection of Semantic Similarity WWW 2005.

The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.

Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.

Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.

The Structure of Broad Topics on the Web

Search Engines and Link Analysis on the Web

Uniform Sampling from the Web via Random Walks

Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.

CS246 Web Characteristics.

Department of Computer Science University of York

Sic Transit Gloria Telae: Towards an Understanding of the Web's Decay

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

CS246: Web Characteristics

Unsupervised learning of visual sense models for Polysemous words

Presentation transcript:

The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

Graph structure of the Web  Over two billion nodes, two trillion links  Power-law degree distribution Pr(degree = k)  1/k 2.1  Looks like a “bow-tie” at large scale IN OUT Strongly connected core (SCC) “This is the Web”

The need for content-based models  Why does a radius-1 expansion help in topic distillation?  Why does topic- specific focused crawling work?  Why is a global PageRank useful for specific queries? Search engine Query Root set Classifier Crawler Check frontier topic Prune if irrelevant Uniform jump Walk to out-neighbor

The need for content-based models  How are different topics linked to each other?  Are topic directories representative of Web topic populations?  Are standard collections (e.g., TREC W10G) representative of Web topics? “This is the Web with topics”

How to characterize “topics”  Web directories—most natural choice  Started with  Keep pruning until all leaf topics have enough (>300) samples  Approx 120k sample URLs  Flatten to approx 482 topics  Train text classifier (Rainbow)  Characterize new document d as a vector of probabilities p d = (Pr(c|d)  c) Classifier Test doc

Critique and defense  Cannot capture fine-grained or emerging topics Emerging topics most often specialize existing broad topics Broad topics rarely change  Classifier may be inaccurate Adequate if much better than random guessing of topic label Can compensate errors using held-out validation data

Background topic distribution  What fraction of Web pages are about Health?  Sampling via random walk PageRank walk (Henzinger et al.) Undirected regular walk (Bar- Yossef et al.)  Make graph undirected  Add self-loops so that all nodes have the same degree  Sample with large stride  Collect topic histograms

Convergence  Start from pairs of diverse topics  Two random walks, sample from each walk  Measure distance between topic distributions L 1 distance |p 1 – p 2 | =  c |p 1 (c) – p 2 (c)| in [0,2] Below.05 —.2 within 300—400 physical pages

Biases in topic directories  Use Dmoz to train a classifier  Sample the Web  Classify samples  Diff Dmoz topic distribution from Web sample topic distribution  Report maximum deviation in fractions  NOTE: Not exactly Dmoz

Topic-specific degree distribution  Preferential attachment: connect u to v w.p. proportional to the degree of v, regardless of topic  More realistic: u has a topic, and links to v with related topics  Unclear if power-law should be upheld Intra-topic linkage Inter-topic linkage

Random forward walk without jumps  Sampling walk is designed to mix topics well  How about walking forward without jumping? Start from a page u 0 on a specific topic Forward random walk (u 0, u 1, …, u i, …) Compare (Pr(c|u i )  c) with (Pr(c|u 0 )  c) and with the background distribution

 Forward walks wander away from starting topic slowly  But do not converge to the background distribution  Global PageRank ok also for topic-specific queries Jump parameter d=.1—.2 Topic drift not too bad within path length of 5—10 Prestige conferred mostly by same-topic neighbors  Also explains why focused crawling works Observations and implications W.p. d jump to a random node W.p. (1-d) jump to an out-neighbor u.a.r. High- prestige node Jump

Citation matrix  Given a page is about topic i, how likely is it to link to topic j? Matrix C[i,j] = probability that page about topic i links to page about topic j Soft counting: C[i,j] += Pr(i|u)Pr(j|v)  Applications Classifying Web pages into topics Focused crawling for topic-specific pages Finding relations between topics in a directory uv

Citation, confusion, correction From topic  True topic  From topic  To topic  Guessed topic  To topic  Arts Business Computers Games Health Home Recreation Reference Science Shopping Society Sports Classifier’s confusion on held-out documents can be used to correct confusion matrix

Fine-grained views of citation Clear block-structure derived from coarse-grain topics Strong diagonals reflect tightly-knit topic communities Prominent off-diagonal entries raise design issues for taxonomy editors and maintainers

Concluding remarks  A model for content-based communities New characterization and measurement of topical locality on the Web How to set the PageRank jump parameter? Topical stability of topic distillation Better crawling and classification  A tool for Web directory maintenance Fair sampling and representation of topics Block-structure and off-diagonals Taxonomy inversion