The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai.

Slides:



Advertisements
Similar presentations
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Advertisements

Link Analysis David Kauchak cs160 Fall 2009 adapted from:
Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Generative Topic Models for Community Analysis
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Distributed PageRank Computation Based on Iterative Aggregation- Disaggregation Methods Yangbo Zhu, Shaozhi Ye and Xing Li Tsinghua University, Beijing,
Exploiting Inter-Class Rules for Focused Crawling İsmail Sengör Altıngövde Bilkent University Ankara, Turkey.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Problem Addressed Attempts to prove that Web Crawl is random & biased image of Web Graph and does not assert properties of Web Graph Understanding the.
1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
CSE 321 Discrete Structures Winter 2008 Lecture 25 Graph Theory.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
HCC class lecture 22 comments John Canny 4/13/05.
CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.
Journal Status* Using the PageRank Algorithm to Rank Journals * J. Bollen, M. Rodriguez, H. Van de Sompel Scientometrics, Volume 69, n3, pp , 2006.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Overview of Web Data Mining and Applications Part I
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Ranking Link-based Ranking (2° generation) Reading 21.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005 A PRESENTATION on What is this Page Known for? Computing Web Page Reputations D. Rafiei.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M.
Using ODP Metadata to Personalize Search University of Seoul Computer Science Database Lab. Min Mi-young.
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
The Structure of Broad Topics on the Web
Artface (Automated reorganization to fit approximate client expectations) Mike Venzke 9/19/2018.
Restrict Range of Data Collection for Topic Trend Detection
Sarthak Ahuja ( ) Saumya jain ( )
CS 440 Database Management Systems
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
Using Link Information to Enhance Web Page Classification
Presentation transcript:

The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai

Introduction & Contribution Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls? Representation of topics in Web directories Topic convergence on directed walks Link-based vs. content-based Web communities

Building Blocks Sampling Web pages –PageRank-based random walk  Wander walk –The Bar-Yossef random walk  Sampling walk Undirected graph Regular Taxonomy design & Document classification –271,954 topics, 6 levels, 1,697,266 sample URLs –Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs –Classification: Rainbow naïve Bayes classifier

Convergence Sampling method –Sampling walk Topic distribution of a set –Soft counting Difference measure –L1 distance

The background distribution vs. breadth-first crawls

Faithful representation of topics in Web directory

Topic-specific degree distributions Power law distribution –Pr(i) = k*1/i x (x>1) Contribution to Class c –Soft-counting –Δd p c (d)

Topical locality and link-based prestige ranking Sampling method –Wander walk Class selection –Dmoz, well-populated Collect all the pages at distance i (i>0)

Topical locality and link-based prestige ranking

Relations between topics Topic citation matrix Contribution to topic citation matrix C –C  C + p(u) T p(v) Implications and application –Improved hypertext classification –Enhanced focused crawling –Reorganizing topic directories

Concluding remarks Characterize some important notions of topical locality on the web Open problems –PageRank jump parameter –Topical stability of distillation algorithms –Better crawling algorithms

Q & A?