Web Intelligence Web Communities and Dissemination of Information and Culture on the www.

Slides:



Advertisements
Similar presentations
Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
(hyperlink-induced topic search)
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
The Further Mathematics network
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
7CCSMWAL Algorithmic Issues in the WWW
DTMC Applications Ranking Web Pages & Slotted ALOHA
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Web Intelligence Web Communities and Dissemination of Information and Culture on the www

The HITS Algorithm: Web Communities Google’s PageRank gives a score to every page, in order to help with relevance and usefulness in search. HITS (Hyperlink-Induced Topic Search) is an alternative method, which tries to find the key pages for specific web communities. HITS focuses on finding authorities (pages which many inlinks) and hubs (pages with many outlinks) that are relevant to specific topics (such as may be gleaned from a search query).

Authorities and Hubs Suppose Rq is a set of pages that have been retrieved by a search engine for a specific query q. Let Ai be the authority score for page i, and let Hi be the hub score for page i. We can initialise these at 1 for every page, and then iterate the following two equations until the numbers settle down:

Authorities and Hubs example A C D B 1. Aa = Hb = 1; Ab = Ha = 1; Ac = Ha + Hb = 2; Ad = Ha + Hb + Hc = 3 Normalise: Aa = ; Ab = 0.143; Ac = 0.286; Ad = Ha = Ab + Ac + Ad = 0.858; Hb = Aa + Ac + Ad = 0.858; Hc = 0.429; Hd = 0 Normalise: Ha = 0.4; Hb = 0.4; Hc = 0.2; Hd = 0 Initially Ha = Hb = Hc = Hd =1

Authorities and Hubs example A C D B 2. Aa = Hb = 0.4; Ab = Ha = 0.4; Ac = Ha + Hb = 0.8; Ad = Ha + Hb + Hc = 1 Normalise: Aa = ; Ab = 0.154; Ac = 0.308; Ad = Ha = Ab + Ac + Ad = 0.848; Hb = Aa + Ac + Ad = 0.848; Hc = 0.386; Hd = 0 Normalise: Ha = 0.356; Hb = 0.356; Hc = 0.288; Hd = 0 Initially Ha = Hb = Hc = Hd =1

Authorities and Hubs example A C D B 3. Aa = Hb = 0.356; Ab = Ha = 0.356; Ac = Ha + Hb = 0.712; Ad = Ha+Hb+Hc = 1 Normalise: Aa = ; Ab = 0.146; Ac = 0.292; Ad = Ha = Ab + Ac + Ad = 0.854; Hb = Aa + Ac + Ad = 0.854; Hc = 0.416; Hd = 0 Normalise: Ha = 0.402; Hb = 0.402; Hc = 0.196; Hd = 0 Initially Ha = Hb = Hc = Hd =1

Authorities and Hubs example A C D B 4. Aa = Hb = 0.402; Ab = Ha = 0.402; Ac = Ha + Hb = 0.804; Ad = Ha+Hb+Hc = 1 Normalise: Aa = ; Ab = 0.154; Ac = 0.308; Ad = Ha = Ab + Ac + Ad = 0.846; Hb = Aa + Ac + Ad = 0.846; Hc = 0.384; Hd = 0 Normalise: Ha = 0.408; Hb = 0.408; Hc = 0.184; Hd = 0 Initially Ha = Hb = Hc = Hd =1

… eventually the numbers converge Authorities and Hubs exhibit mutually reinforcing relationships. A good hub points to many good authorities A good authority is pointed to by many good hubs. (as is also true with PageRank …) the calculation is done in a different way. This is indicated in the HITS algorithm pseudocode on the next slide. This alg says how HITS responds to a query, q Contrast this with how google deals with q.

The HITS Algorithm 1.Get the r highest ranked pages for query q; call the pages Rq 2.Expand these to set Sq, containing all pages pointed to by pages in Rq, and add up to d pages that point to pages in Rq. 3.Consider the link graph of Sq, G. There are transverse links (between pages in Sq that have different domain names), and intrinsic links (between pages with the same domain name). Delete all intrinsic links of G 4.Obtain a ranked list of authorities in G. This can be done by the simple repeated iteration of authority scores and hubs scores. But it is done in practice by: Form the adjacency matrix of G, A, and its transpose A T Find the normalised principal eigenvector e of A T A Values in this eigenvector correspond to authority scores. (reasonable parameter values: r = 200; d = 50 – leads to around 1000—5000 pages in Sq)

A Problem with HITS Examinable Reading: the Nomura et al paper, sections 1, 2 and 3. Understand the problem. Main point: any technique for deciding on the importance of a web page can be misled, either deliberately or not, by certain link structures. For example, how might you deceive the PageRank method into thinking that your www page was important?

Cultural Dissemination: Axelrod’s Model Axelrod formulated a simple and very influential model of cultural dissemination. That is: how ideas, traits, characteristics, fashions, etc …, spread in communities. This model (and culture dissemination models in general) help us understand the factors that lead to: Globalisation: where an entire community becomes very similar in some aspect (e.g. everyone using google? everyone using English as the language of science?) Polarization: for a particular aspect, the community divides into two distinct choices – e.g. Windows users and Mac users Differentiation: more than two stable sub-communities: e.g. the presence of different religious groups.

Assumptions in Cultural Models The two key bases in a cultural model are: People like to change, to become a little more like the people in their own social group. E.g. wear similar clothes, go to similar restaurants, adopt similar views. People are more likely to be influenced by those who are already similar to them. E.g. a Norwegian goatherder will consider buying boots that are like his respected neighbour’s boots, but not the Australian prime minister’s boots. These assumptions are demonstrably true. So, why doesn’t everyone eventually become the same? How long does it take for globalisation to occur for a particular aspect? These and other questions are explored by using cultural models.

Axelrod’s model of cultural dissemination Individuals are placed on a spatial grid – although any spatial structure can be used. Each individual has F features (e.g. religion, fashion, diet, …), and each feature has q possible values. The feature vectors are initially random (or otherwise, depending on what experiment you want to do). The model typically runs as follows: 1. A random individual a is chosen, and a random neighbour of that individual, b, is chosen 2.If a and b have x features in common (1 < x < F), then a will change to match another one of b’s features, with probability q/F.

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, CMac, Jedi, Bike, C Mac, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java Mac, Jedi, Car, JavaPC, Sith, Bike, C Mac, Jedi, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, CMac, Jedi, Bike, C Mac, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java Mac, Jedi, Car, JavaPC, Sith, Bike, C Mac, Jedi, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java Choose a random individual (red) and a neigbour (blue)

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, CMac, Jedi, Bike, C PC, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java Mac, Jedi, Car, JavaPC, Sith, Bike, C Mac, Jedi, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java This individual changes to be a bit closer to neighbour

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, CMac, Jedi, Bike, C PC, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java PC, Sith, Bike, C Mac, Sith, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java Another random individual and a neighbour

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, C Mac, Jedi, Bike, Java PC, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java PC, Sith, Bike, C Mac, Sith, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java Again a random individual is chosen, and a neighbour, but these two are already the same, so no change.

Simple Example PC, Jedi, Car, C Mac, Sith, Bike, CMac, Jedi, Bike, C Mac, Jedi, Bike, Java PC, Jedi, Car, JavaPC, Jedi, Bike, Java PC, Jedi, Car, Java PC, Sith, Bike, C Mac, Sith, Car, Java PC, Sith, Bike, CMac, Jedi, Bike, Java No change – they are too different, so the individual is not influenced by the neighbour

Eventually, something like this happens …

Mac, Sith, Bike, C PC, Jedi, Car, Java Mac, Sith, Bike, C The community has polarized into two distinct types. Alternatively, it may have globalized, or it may have differentiated. What happens depends on subtleties of the parameters in context. I.e. the degree of difference thresholds within which individuals will be influenced by neighbours; the size of the neighbourhood, and so on.

Proper examples: evolution of `cultural domains’

Interesting transitions This axis shows the size of stable communities that emerge Globalisation Major differentiation polarisation Globalisation apparent when few choices for a feature – polarisation more common in small communities?

Lots of research starting to be done on spread of ideas and culture on the WWW – using Axelrod- style models on web graphs. Think about examples of polarisation and globalisation on the www that you think have happened, or are likely to happen. The papers I got the last three figures from are on the www as recommended reading. The end