CSCE822 Data Mining and Warehousing

CSCE822 Data Mining and Warehousing
Lecture 19 Web Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce822 University of South Carolina Department of Computer Science and Engineering

Outline Web mining applications Background on Web Search
VIPS (VIsion-based Page Segmentation) Block-based Web Search Block-based Link Analysis Noise: First, web pages usually do not contain pure content. A web page typically contains various types of materials that are not related to the topic of the web-page. Multiple topics: Secondly, a web page usually contains multiple topics, for example, a news page containing many different comments on a particular event or politician, or a conference web page containing sponsors from different companies and organizations. Although traditional documents also often have multiple topics, they are less diverse so that the impact on retrieval performance is smaller. There exist some new characteristics in web pages: Two-Dimension Logical Structure – Different from free-text documents, web pages have a 2-D view and a more sophisticated internal content structure. Each block of a web page could have relationships with blocks from up to four directions and contain or be contained in some other blocks. A content structure in semantic level exists for most pages and can be used to enhance retrieval. Visual Layout Presentation – To facilitate browsing and attract attention, web pages usually contain much visual information in the tags and properties in HTML [20]. Typical visual hints include lines, blank areas, colors, pictures, fonts, etc. Visual cues are very helpful to detect the semantic regions in web pages. Data Mining: Principles and Algorithms 5/12/2018

Web mining Categories of mining
Web usage mining: analyse and discover interesting patterns of user’s usage data on the web. Server logs Web content mining: discover useful information from text, image, audio or video data in the web Text mining, image mining, etc Web structure mining: using graph theory to analyse the node and connection structure of a web site Link mining in search engines

Research Problems in Web Mining
Google: what is the next step? How to find the pages that match approximately the sohpisticated documents, with incorporation of user-profiles or preferences? Look back of Google: inverted indicies Construction of indicies for the sohpisticated documents, with incorporation of user-profiles or preferences Similarity search of such pages using such indicies Data Mining: Principles and Algorithms 5/12/2018

Search Engine – Two Rank Functions
Ranking based on link structure analysis Web Pages Meta Data Forward Index Inverted Link Backward Link (Anchor Text) Web Topology Graph Web Page Parser Indexer Anchor Text Generator Web Graph Constructor Importance Ranking (Link Analysis) Rank Functions URL Dictioanry Term Dictionary (Lexicon) Search Relevance Ranking Similarity based on content or text Data Mining: Principles and Algorithms 5/12/2018

Relevance Ranking Inverted index
- A data structure for supporting text queries - like index in a book aalborg , , ….. . arm , 19, 29, 98, 143, ... armada , 457, 789, ... armadillo , 2134, 3970, ... armani , 256, 372, 511, ... zz , 1189, 3209, ... indexing disks with documents inverted index

Page Rank Algorithm (Intuitive)
Basic idea significance of a page is determined by the significance of the pages linking to it

The PageRank Algorithm
More precisely: Link graph: adjacency matrix A, Constructs a probability transition matrix M by renormalizing each row of A to sum to 1 Treat the web graph as a markov chain (random surfer) The vector of PageRank scores p is then defined to be the stationary distribution of this Markov chain. Equivalently, p is the principal right eigenvector of the transition matrix Data Mining: Principles and Algorithms 5/12/2018

Layout Structure Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats, separators, blank areas, colors, pictures, etc Different parts of a page are not equally important Title: CNN.com International H1: IAEA: Iran had secret nuke agenda H3: EXPLOSIONS ROCK BAGHDAD … TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN… Hyperlink: URL: Anchor Text: AI oaeda… Image: URL: Alt & Caption: Iran nuclear … Anchor Text: CNN Homepage News … Data Mining: Principles and Algorithms 5/12/2018

Web Page Block—Better Information Unit
Web Page Blocks Importance = Med Importance = Low Importance = High Data Mining: Principles and Algorithms 5/12/2018

Motivation for VIPS (VIsion-based Page Segmentation)
Problems of treating a web page as an atomic unit Web page usually contains not only pure content Noise: navigation, decoration, interaction, … Multiple topics Different parts of a page are not equally important Web page has internal structure Two-dimension logical structure & Visual layout presentation > Free text document < Structured document Layout – the 3rd dimension of Web page 1st dimension: content 2nd dimension: hyperlink Noise: First, web pages usually do not contain pure content. A web page typically contains various types of materials that are not related to the topic of the web-page. Multiple topics: Secondly, a web page usually contains multiple topics, for example, a news page containing many different comments on a particular event or politician, or a conference web page containing sponsors from different companies and organizations. Although traditional documents also often have multiple topics, they are less diverse so that the impact on retrieval performance is smaller. There exist some new characteristics in web pages: Two-Dimension Logical Structure – Different from free-text documents, web pages have a 2-D view and a more sophisticated internal content structure. Each block of a web page could have relationships with blocks from up to four directions and contain or be contained in some other blocks. A content structure in semantic level exists for most pages and can be used to enhance retrieval. Visual Layout Presentation – To facilitate browsing and attract attention, web pages usually contain much visual information in the tags and properties in HTML [20]. Typical visual hints include lines, blank areas, colors, pictures, fonts, etc. Visual cues are very helpful to detect the semantic regions in web pages. Data Mining: Principles and Algorithms 5/12/2018

Is DOM a Good Representation of Page Structure?
Page segmentation using DOM Extract structural tags such as P, TABLE, UL, TITLE, H1~H6, etc DOM is more related content display, does not necessarily reflect semantic structure How about XML? A long way to go to replace the HTML DOM in general provides a useful structure for a web page. But tags such as TABLE and P are used not only for content organization, but also for layout presentation. In many cases, DOM tends to reveal presentation structure other than content structure, and is often not accurate enough to discriminate different semantic blocks in a web page. Data Mining: Principles and Algorithms 5/12/2018

VIPS Algorithm Motivation:
In many cases, topics can be distinguished with visual clues. Such as position, distance, font, color, etc. Goal: Extract the semantic structure of a web page based on its visual presentation. Procedure: Top-down partition the web page based on the separators Result A tree structure, each node in the tree corresponds to a block in the page. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception. Each block will be assigned an importance value Hierarchy or flat Data Mining: Principles and Algorithms 5/12/2018

VIPS: An Example A hierarchical structure of layout block
A Degree of Coherence (DOC) is defined for each block Show the intra coherence of the block DoC of child block must be no less than its parent’s The Permitted Degree of Coherence (PDOC) can be pre-defined to achieve different granularities for the content structure The segmentation will stop only when all the blocks’ DoC is no less than PDoC The smaller the PDoC, the coarser the content structure would be Data Mining: Principles and Algorithms 5/12/2018

Example of Web Page Segmentation (1)
( DOM Structure ) ( VIPS Structure ) Data Mining: Principles and Algorithms 5/12/2018

Example of Web Page Segmentation (2)
( DOM Structure ) ( VIPS Structure ) Can be applied on web image retrieval Surrounding text extraction Data Mining: Principles and Algorithms 5/12/2018

Web Page Block—Better Information Unit
Page Segmentation Vision based approach Block Importance Modeling Statistical learning Web Page Blocks Importance = Med Importance = Low Importance = High Data Mining: Principles and Algorithms 5/12/2018

Block-based Web Search
Index block instead of whole page Block retrieval Combing DocRank and BlockRank Block query expansion Select expansion term from relevant blocks Data Mining: Principles and Algorithms 5/12/2018

Experiments Dataset TREC 2001 Web Track TREC 2002 Web Track
WT10g corpus (1.69 million pages), crawled at 1997. 50 queries (topics ) TREC 2002 Web Track .GOV corpus (1.25 million pages), crawled at 2002. 49 queries (topics ) Retrieval System Okapi, with weighting function BM2500 Preprocessing Stop-word list (about 220) Do not use stemming Do not consider phrase information Tune the b, k1 and k3 to achieve the best baseline Data Mining: Principles and Algorithms 5/12/2018

Block Retrieval on TREC 2001 and TREC 2002
These figures and tables shows the experimental results on block retrieval using different page segmentation methods. FullDoc is not listed here since it will always get the baseline. The third column shows the results of using single-best block rank, and the last column shows the results of combining block rank and document rank, with α being optimal for each specific method. The dependency between and α is illustrated in Figure 3, in which all the curves converge to the baseline when α = 1. As can be seen from Figures and tables, if only the best block from each document is used to rank pages, DomPS performs the worst and FixedPS a little bit better, both of which are worse than the baseline for both data sets. VIPS is slightly better than baseline in TREC 2001 but fails to exceed baseline in TREC 2002, though it is the best among all the methods. CombPS wins TREC 2001, but is worse than VIPS in TREC For TREC 2002, no method can outperform the baseline. When block rank is combined with the original document rank, the performance of all these four methods increases significantly and is better than the baseline. This shows the effect of rank combination, similar to traditional passage retrieval [4]. DomPS is still the worst, and FixedPS is slightly better. VIPS and CombPS are still better than the former two and show similar comparison characteristics to the non-combining situations, except that result of CombPS (0.2379) is now much closer to that of VIPS (0.2408) in TREC 2002. Furthermore, from Figures it can be seen that the winner for either data set shows a consistent improvement compared to the other methods, and thus does not win by chance. For TREC 2001 CombPS gets better performance almost in every combination, and for TREC 2002 CombPS shares rather similar trends as VIPS when α exceeds 0.4. TREC 2001 Result TREC 2002 Result Data Mining: Principles and Algorithms 5/12/2018

Query Expansion on TREC 2001 and TREC 2002
We perform each web page segmentation method and choose top-ranked blocks (top-ranked documents for FullDoc) to do query expansion. In Table 4, the value for each segmentation method is the best performance achieved seen from Figure 5. Figure 5 illustrates the values given different number of blocks (documents in FullDoc). Figure 6 also shows the same comparison for TREC 2001 by using average precision as evaluation metric. From the experimental results, a general conclusion can be made that partitioning pages into blocks can improve the performance of query expansion, regardless of which page segmentation method is used. Furthermore, “good” segmentation method can improve the performance significantly and stably. Among all the page segmentation methods, FullDoc does nothing and thus may get good results (in TREC 2001) or bad results (in TREC 2002), but FixedPS, VIPS and CombPS can always get better results. DomPS is still unstable and sometimes even worse than the baseline. The performance of VIPS and FixedPS is similar, except that VIPS shows better performance in AvP, and that normally they achieve the peak at different number of blocks. CombPS, on the other hand, is always the best method and could achieve at most 17.3% improvement in and 28.5% in AvP. TREC 2001 Result TREC 2002 Result Data Mining: Principles and Algorithms 5/12/2018

Block-level Link Analysis
Data Mining: Principles and Algorithms 5/12/2018 C

A Sample of User Browsing Behavior
Data Mining: Principles and Algorithms 5/12/2018

Improving PageRank using Layout Structure
Z: block-to-page matrix (link structure) X: page-to-block matrix (layout structure) Block-level PageRank: Compute PageRank on the page-to-page graph BlockRank: Compute PageRank on the block-to-block graph Data Mining: Principles and Algorithms 5/12/2018

Using Block-level PageRank to Improve Search
Search = a * IR_Score + (1- a) * PageRank Block-level PageRank achieves 15-25% improvement over PageRank (SIGIR’04) Data Mining: Principles and Algorithms 5/12/2018

Mining Web Images Using Layout & Link Structure (ACMMM’04)
Data Mining: Principles and Algorithms 5/12/2018

Image Graph Model & Spectral Analysis
Block-to-block graph: Block-to-image matrix (container relation): Y Image-to-image graph: ImageRank Compute PageRank on the image graph Image clustering Graphical partitioning on the image graph 5/12/2018 Data Mining: Principles and Algorithms

ImageRank Relevance Ranking Importance Ranking Combined Ranking
5/12/2018 Data Mining: Principles and Algorithms

ImageRank vs. PageRank Dataset 26.5 millions web pages
11.6 millions images Query set 45 hot queries in Google image search statistics Ground truth Five volunteers were chosen to evaluate the top 100 results re-turned by the system (iFind) Ranking method Data Mining: Principles and Algorithms 5/12/2018

ImageRank vs PageRank Image search accuracy using ImageRank and PageRank. Both of them achieved their best results at =0.25. Data Mining: Principles and Algorithms 5/12/2018

Summary Papers VIPS demo & dll
More improvement on web search can be made by mining webpage Layout structure Leverage visual cues for web information analysis & information extraction Demos: Papers VIPS demo & dll Data Mining: Principles and Algorithms 5/12/2018

Slides Credits Slides in this presentation are partially based on the work of Han. Textbook Slides

CSCE822 Data Mining and Warehousing

Similar presentations

Presentation on theme: "CSCE822 Data Mining and Warehousing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE822 Data Mining and Warehousing

Similar presentations

Presentation on theme: "CSCE822 Data Mining and Warehousing"— Presentation transcript:

Similar presentations

About project

Feedback