Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Slides:

Advertisements

Similar presentations

Advertisements

Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

Information Networks Link Analysis Ranking Lecture 8.

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

CS 345A Data Mining Lecture 1 Introduction to Web Mining.

Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.

Web Mining Research: A Survey

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

Link Analysis, PageRank and Search Engines on the Web

Link Structure and Web Mining Shuying Wang

The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.

Computer Science 1 Web as a graph Anna Karpovsky.

Link Analysis HITS Algorithm PageRank Algorithm.

Overview of Web Data Mining and Applications Part I

Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.

HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University

Data Mining By Dave Maung.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:

Mathematics of Networks (Cont)

CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Algorithmic Detection of Semantic Similarity WWW 2005.

Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.

Ranking Link-based Ranking (2° generation) Reading 21.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Post-Ranking query suggestion by diversifying search Chao Wang.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

1 CS 430: Information Discovery Lecture 5 Ranking.

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Roberto Battiti, Mauro Brunato

HITS Hypertext-Induced Topic Selection

Methods and Apparatus for Ranking Web Page Search Results

Informetrics, Webometrics and Web Use metrics

Text & Web Mining 9/22/2018.

Graphs Chapter 11 Objectives Upon completion you will be able to:

Data Mining Chapter 6 Search Engines

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Graph and Link Mining.

Presentation transcript:

Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

December 2008©GKGupta2 Web Data Mining Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services. Web mining may be divided into three categories: 1. Web content mining 2. Web structure mining 3. Web usage mining

December 2008©GKGupta3 Web details More than 20 billion pages in 2008 Many more documents in databases accessible from the Web More than 4m servers A total of perhaps 100 terabytes More than a million pages are added daily Several hundred gigabytes change every month Hyperlinks for navigation, endorsement, citation, criticism or plain whim

December 2008©GKGupta4 Web Index Size in Pages Total Web is estimated to be about 23B pages. Source 20 November 2008 Search Engine Number of Pages in Billions Google16 MSN Search 7 Yahoo50 Ask4.2

December 2008©GKGupta5 Size of the Web GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista September 2003

December 2008©GKGupta6 Size Trends GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista September 2003

December 2008©GKGupta7 Web Some 80% of Web pages are in English About 30% of domains are in.com domain

December 2008©GKGupta8 Graph terminology Web is a graph – vertices and edges (V,E) Directed graph – directed edges (p,q) Undirected graph - undirected edges (p,q) Strongly connected component - a set of nodes such that for any (u,v) there is a path from u to v Breadth first search Diameter of a graph Average distance of the graph

December 2008©GKGupta9 Graph terminology Breadth first search - layer 1 consists of all nodes that are pointed by the root, layer k consists of all nodes that are pointed by nodes on level k-1 Diameter of a graph - maximum over all ordered pairs (u,v) of the shortest path from u to v

December 2008©GKGupta10 Web size In-degree is number of links to a node Out-degree is the number of links from a node Fraction of pages with i in-links is proportional to 1/i 2.1 With i out-links, it is 1/i 2.72 iIn- links Out- links 223%15% 310%5% 4 2% 53%1%

December 2008©GKGupta11 Citations Lotka’s Inverse-Square Law - Number of authors publishing n papers is about 1/n 2 of those with only one. 60% of all authors that make a single contribution. Less than 1% publish 10 or more papers. Most web pages are linked only to one other page (many not linked to any). Number of pages with multiple in-links declines quickly. Rich get richer concept!

December 2008©GKGupta12 Web graph structure Web may be considered to have four major components Central core – strongly connected component (SCC) – pages that can reach one another along directed links - about 30% of the Web IN group – can reach SCC but cannot be reached from it - about 20% OUT group – can be reached from SCC but cannot reach it - about 20%

December 2008©GKGupta13 Web graph structure Tendrils – cannot reach SCC and cannot be reached by it - about 20% Unconnected – about 10% The Web is hierarchical in nature. The Web has a strong locality feature. Almost two thirds of all links are to sites within the enterprise domain. Only one- third of the links are external. Higher percentage of external links are broken. The distance between local links tends to be quite small.

December 2008©GKGupta14 Web Size Diameter over 500 Diameter of the central core SCC is about 30 Probability that you can reach from any node a to any b is 24% When you can, on average the path length is 16

December 2008©GKGupta15 Terminology Child (u) and parent (v) – obvious Bipartite graph – BG(T,I) is a graph whose node set can be partitioned into two subsets T and I. Every directed edge of BG joins a node in T to a node in I. A single page may be considered an example of BG with T consisting of the page and I consisting of all its children.

December 2008©GKGupta16 Terminology Dense BG – let p and q be nonzero integer variables and t c and i c be the number of nodes in T and I. A DBG is a BG where each node of T establishes an edge with at least p (1  p  i c ) nodes of I and at least q (1  q  t c ) nodes of T establish an edge with each node of I Complete BG or CBG – where p = i c and q = t c.

December 2008©GKGupta17 Web Content mining Discovering useful information from contents of Web pages. Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks. The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured.

December 2008©GKGupta18 Web Content mining Use of the search engines to find content in most cases does not work well, posing an abundance problem. Searching phrase “data mining” gives 2.6m documents. It provides no information about structure of content that we are searching for and no information about various categories of documents that are found. Need more sophisticated tools for searching or discovering Web content.

December 2008©GKGupta19 Similar Pages Proliferation of similar or identical documents on the Web. It has been found that almost 30% of all web pages are very similar to other pages and about 22% are virtually identical to other pages. Many reasons for identical pages (and mirror sites), for example, faster access or to reduce international network traffic. How to find similar documents?

December 2008©GKGupta20 Web Content New search algorithms Image searching Dealing with similar pages

December 2008©GKGupta21 Web Usage mining Making sense of Web users' behaviour. Based on data logs of users’ interactions of the Web including referring pages and user identification. The logs may be Web server logs, proxy server logs, browser logs, etc. Useful is finding user habits which can assist in reorganizing a Web site so that high quality of service may be provided.

December 2008©GKGupta22 Web Usage mining Existing tools report the number of hits of Web pages and where the hits came from. Although useful, the information is not sufficient to learn user behaviour. Tools providing further analysis of such information are useful. May involve usage pattern discovery e.g. the most commonly traversed paths through a Web site or all paths traversed through a Web site.

December 2008©GKGupta23 Web Usage mining These patterns need to be interpreted, analyzed, visualized and acted upon. One useful way to represent this information might be graphs. The traversal patterns provides us very useful information about the logical structure of the Web site. Association rule methodology may also be used in discovering affinity between various Web pages.

December 2008©GKGupta24 Web data mining Early work by Kleinberg in Links represent human judgement If the creator of p provides a link to q then it confers some authority to q although that is not always true When search engine results are obtained should they be ranked by number of in-links?

December 2008©GKGupta25 Web Structure mining Discovering the link structure or model underlying the Web. The model is based on the topology of the hyperlinks. This can help in discovering similarity between sites or in discovering authority sites for a particular topic or discipline or in discovering overview or survey sites that point to many authority sites (called hubs).

December 2008©GKGupta26 Authority and Hub A hub is a page that has links to many authorities An authority is a page with good content on the query topic and pointed by many hub pages, that is, it is relevant and popular

December 2008©GKGupta27 Kleinberg’s algorithm Want all authority pages for say “data mining” Google returns 2.6m pages, may not even include topics like clustering and classification We want s pages such that 1.s is relatively small 2.s is rich in relevant pages 3.s contains most of the strongest authorities

December 2008©GKGupta28 Kleinberg’s algorithm Conditions 1 and 2 (small and relevant) are satisfied by selecting 100 or 200 most highly ranked from a search engine Needs to use an algorithm to satisfy 3

December 2008©GKGupta29 Kleinberg’s algorithm Let R be the number of pages selected from search result S = R For each page in S, do steps 3-5 Let T be the set of all pages pointed by S Let F be the set of all pages pointed to S Let S := S + T + some or all of F Delete all links with the same domain name Return S

December 2008©GKGupta30 Kleinberg’s algorithm S is the base for a given query – hubs and authorities are found from it One could order the pages in S by in-degrees but this does not always work Authority pages should not only have large in-degrees but also considerable overlap in the sets of pages that point to them Thus we find hubs and authorities together

December 2008©GKGupta31 Kleinberg’s algorithm Hubs and authorities have a mutually reinforcing relationship: a good hub points to many good authorities; a good authority is pointed to by many good hubs. We need a method of breaking this circularity

December 2008©GKGupta32 An iterative algorithm Let page p have an authority weight x p and a hub weight y p Weights are normalized, squares of sum of each x and y weights is 1 If p points to many pages with large x weights, its y weight is increased (call it O) If p is pointed to by many pages with large y values, its x weight is increased (call it I)

December 2008©GKGupta33 An iterative algorithm On termination, the pages with largest weights are the query authorities and hubs Theorem: The sequences of x and y weights converge. Proof: Let A be the adjacency matrix of G, that is, (i,j) entry of A is 1 if (pi, pj) is an edge in G, 0 otherwise. I and O operations may be written as x  AT y y  A x

December 2008©GKGupta34 Kleinberg’s algorithm I and O operations may be written as x  A T y y  A x x k is the unit vector given by x k = (A T A) k-1 x 0 y k is the unit vector given by y k = (AA T )y 0 Standard Linear algebra shows both converge.

December 2008©GKGupta35 Intuition A page creator creates the page and displays his interests by putting links to other pages of interest. Multiple pages are created with similar interests. This is captured by a DBG abstraction. A CBG would extract a smaller set of potential members although there is no guarantee such a core will always exist in a community.

December 2008©GKGupta36 Intuition A DBG abstraction is perhaps better than a CBG abstraction for a community because a page-creator would rarely put links to all the pages of interest in the community.

December 2008©GKGupta37 Web Communities A Web community is a collection of Web pages that have a common interest in a topic Many Web communities are independent of geography or time zones How to find communities given a collection of pages? It has been suggested that community core is a complete bipartite graph (CBG) –can we use Kleinberg’s algorithm?

December 2008©GKGupta38 Discovering communities Definition of community A Web community is characterized by a collection of pages that form a linkage pattern equal to a DBG(T, I, ,  ), where ,  are thresholds that specify linkage density.

December 2008©GKGupta39 Discovering communities Assert that Web communities are characterized by a collection of pages that form a linkage pattern equal to DBG A seed set is given Apply a related page algorithm (e.g. Kleinberg) to each seed, derive related pages

December 2008©GKGupta40 cocitation Citation analysis is study of links between publications Cocitation measures the number of citations in common between two documents Bibliographic coupling measures the number of documents that cite both of the two documents

December 2008©GKGupta41 cocite Cocite is a set of pages is related, if there exist a set of common children n pages are related under cocite if these pages have common children at least equal to cocite-factor If a group of pages are related according to cocite relationship, these pages form an appropriate CBG.