Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1.

Slides:



Advertisements
Similar presentations
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.
Advertisements

Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Chapter 5: Introduction to Information Retrieval
Web indexing ICE0534 – Web-based Software Development July Seonah Lee.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Information Retrieval in Practice
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Link Analysis HITS Algorithm PageRank Algorithm.
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Retrieval in Practice
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
Information Retrieval
HITS Hypertext-Induced Topic Selection
Web Crawling.
Text & Web Mining 9/22/2018.
A Comparative Study of Link Analysis Algorithms
Information Retrieval
HITS Hypertext Induced Topic Selection
Anatomy of a search engine
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
HITS Hypertext Induced Topic Selection
Web Page Cleaning for Web Mining
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1

Non-Relevant Data on the Web “non-relevant” – not directly related to the main topic / functionality of the page Example – A fundamental problem on the Web: Many pages contain lots of non-relevant data 2

Hypertext IR Principles Relevant Linkage Principle [Kleinberg 1997] –p links to q  q is relevant to p Topical Unity Principle [Kessler 1963, Small 1973] –q 1 and q 2 are co-cited in p  q 1 and q 2 are related to each other Lexical Affinity Principle [Maarek et al. 1991] –The closer the links to q 1 and q 2 are the stronger the relation between them. Underlying principles of all link based IR tools: 3

Example: HITS & Clever [Kleinberg 1997, Chakrabarti et al. 1998] Uses the Relevant Linkage Principle –All links propagate score from hubs to authorities and vice versa Uses the Topical Unity Principle –Co-cited authorities propagate score to each other Clever uses the Lexical Affinity Principle –text around links is used to weight relevance of the links HubsAuthorities 4

Example: Focused Crawler [Chakrabarti et al. 1999] Goal –fetch pages relevant to a given topic Technique –Order already crawled pages according to relevance to the topic –Crawl over the links from the top page in the list –Remove top page from the list Uses the Relevant Linkage Principle and the Topical Unity Principle –All the links from the top page are assumed relevant to the topic 5

Link Based Web IR Tools Search algorithms –HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998] –Google [Brin and Page 1998] –SALSA [Lempel and Moran 2000] Finding similar pages –Co-Citation [Dean and Henzinger 1999] Hypertext classification –Hyperclass [Chakrabarti et al. 1998] Focused crawling –FOCUS [Chakrabarti et al. 1999] Page clustering –[Modha and Spangler 2000] 6

Violations of the Hypertext IR Principles Frequent violations of all hypertext IR principles Violations are caused by systematic phenomena on the Web Violations significantly deteriorate accuracy of the hypertext IR tools 7

Violations of Relevant Linkage Principle Navigational links – Download links – Advertisement links – Endorsement links – Spam links 8

Violations of Topical Unity Principle Violations of the Relevant Linkage Principle Bookmark pages –Kjhan's Bookmark ListsKjhan's Bookmark Lists General resource lists –Links of Interest to Electrical EngineersLinks of Interest to Electrical Engineers Personal homepages –Ron Fagin's Home PageRon Fagin's Home Page 9

Violations of Lexical Affinity Principle Alphabetical index lists –Computer and Communication Companies ("M" entries)Computer and Communication Companies ("M" entries) HTML representation –Adjacent cells in the same column are far from each other in the HTML text 10

Templates Semantic Definition: A template is a master HTML shell page that is used as a basis for composing new pages –Content of new pages plugged into template shell –All pages share common look & feel –Example: Usually controlled by a central authority –Not necessarily confined to a single site (e.g., Amazon and drugstore.com) May include variety of data –Navigational bars –Advertisements –Company info and policies 11

Why are Templates Bad for IR Tools? Violate the hypertext IR principles –Relevant linkage principle –Topical unity principle Extremely common –Became standard in web site design 12

IR Tool Problems Generalization –Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites Topic drift –Search for “Finite Model Theory” and get SF 49’ers fan web sites Irrelevance –Get “Yahoo” as a result regardless of the query Bias –Search for “computing companies” and get Microspy highly ranked 13

Hypertext Improvement Problem Develop hypertext processing techniques that: automatically improve hypertext data are efficient and scalable Main Goal remove violations of the Hypertext IR principles process quickly millions of pages 14

Hypertext Cleaning Web Crawler Hypertext Cleaner IR Tool 15

Previous Hypertext Improvement Techniques Heuristics –Ignore intra-site (“nepotisitic”) links [Kleinberg 1997] –Ignore links to popular sites (“stop sites”) [Chakrabarti et al. 1998, Bharat and Henzinger 1998] Query dependent techniques –Weight links according to relevance to query [Chakrabarti et al. 1999, Bharat and Henzinger 1998] Pre-processing techniques –Eliminate duplicate pages [Broder et al. 1997] –Identify “noisy” links automatically [Davison 2000] 16

Pagelets Semantic Definition: A pagelet is a maximal region of a page that has a single topic or functionality –Not too large has only one topic / functionality –Not too small any larger region that contains it has other topics / functionalities Example: 17

IR with Pagelets Use pagelets rather than pages as atomic units for information retrieval Main Idea 1 Satisfy Relevant Linkage Principle Satisfy Topical Unity Principle 18

IR with Pagelets (cont.) Drawbacks –Lose some semantic data latent in pages –No natural link structure on pagelets Issues –How to divide a page into pagelets? –How to adapt IR tools to work with pagelets? 19

Pagelets: Syntactic Definition A pagelet is a node in the HTML parse tree of a page satisfying the following: –Its HTML tag is one of the following:,,,,,, … –It contains at least 3 hyperlinks –None of its children is a pagelet 20

Template Elimination How to recognize templates efficiently? –Templates vs. mirrors –Templates vs. accidental pagelet similarities Main Idea 2 Eliminate pagelets belonging to templates 21

Templates: Syntactic Definition Similarity –p 1,…,p k are identical or almost identical Connectivity –Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. A template is a collection T = (p 1,…,p k ) of pagelets satisfying: p1p1 p3p3 p5p5 p2p2 p4p4 22 Template Recognition Problem: Given a set of pages S find all the templates in S.

Cluster pagelets in S according to shingle Calculate shingle(p) for each pagelet p  S Eliminate Duplicate Pages from S Template Recognition in Small Sets In small sets: hard to validate connectivity requirement low chance of accidental pagelet similarities 23 Output clusters of size > 1

Template Recognition in Large Sets 24 Cluster pagelets in S according to shingle Calculate shingle(p) for each pagelet p  S Discard clusters of size 1 For each remaining cluster C: Construct graph G c of pages that own pagelets in C Find undirected connected components of G c Output components of size > 1

Scalability Store pages and pagelets on database tables Template recognition & elimination can be carried out by a few cheap database operations –Finding the connected components can be done in main memory using BFS 25

Example: Clever Hubs Authorities Hubs – all non-template pagelets in the base set Authorities – all pages in the base set 26

Classical Clever vs. “Clean” Clever 27

Classical Clever vs. “Clean” Clever (cont.) 28

Classical Clever vs. “Clean” Clever (cont.) 29

Conclusions Contributions –Formulation of the hypertext improvement problem –Introduction of pagelets and templates as means of improving hypertext IR –Efficient algorithms for pagelet and template recognition –Demonstration of technique’s effectiveness by improving Clever’s precision Future work –Test the new technique with other IR algorithms –Find new hypertext improvement techniques 30

Thank You! 31

The Yahoo Template 32

The Yahoo Pagelets 33