On Search, Ranking, and Matchmaking in Information Networks Sergei Maslov Brookhaven National Laboratory.

Slides:



Advertisements
Similar presentations
Regularization David Kauchak CS 451 – Fall 2013.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 CE 530 Molecular Simulation Lecture 8 Markov Processes David A. Kofke Department of Chemical Engineering SUNY Buffalo
Detecting topological patterns in complex networks Sergei Maslov Brookhaven National Laboratory.
Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Link Analysis, PageRank and Search Engines on the Web
Lecture 4. Modules/communities in networks What is a module? Nodes in a given module (or community group or a functional unit) tend to connect with other.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Friends and Locations Recommendation with the use of LBSN
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Overview of Web Ranking Algorithms: HITS and PageRank
Comp. Genomics Recitation 3 The statistics of database searching.
Collaborative Filtering  Introduction  Search or Content based Method  User-Based Collaborative Filtering  Item-to-Item Collaborative Filtering  Using.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
STAR Sti, main features V. Perevoztchikov Brookhaven National Laboratory,USA.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
STAR Kalman Track Fit V. Perevoztchikov Brookhaven National Laboratory,USA.
Ranking Link-based Ranking (2° generation) Reading 21.
Workshop on Optimization in Complex Networks, CNLS, LANL (19-22 June 2006) Application of replica method to scale-free networks: Spectral density and spin-glass.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
KPS 2007 (April 19, 2007) On spectral density of scale-free networks Doochul Kim (Department of Physics and Astronomy, Seoul National University) Collaborators:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Degree and Eigenvector Centrality
Advanced Artificial Intelligence
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Presentation transcript:

On Search, Ranking, and Matchmaking in Information Networks Sergei Maslov Brookhaven National Laboratory

Information networks: WWW and beyond First part of my talk: webpages in the world: need to search and rank!! Second part: opinion networks WWW can be thought of as a network of opinions (hyperlinks – positive votes) Our choices and opinions on products, services and each other – a much larger opinion network! Very incomplete (sparse)  one could use intelligent “matchmaking” to match users to new products or each other

Ranking webpages Assign an “importance factor” G i to every webpage Given a keyword (say “jaguar”) find all the pages that have it in their text and display them in the order of descending G i. One solution still used in scientific publishing is G i =K in (i) (the number of incoming links), but: Too democratic: It doesn’t take into account the importance of nodes sending links Easy to trick and artificially boost the ranking

How Google works Google’s recipe (circa 1998) is to simulate the behavior of many virtual “random surfers” PageRank: G i ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page Popular pages send more surfers your way  PageRank ~ K in weighted by the popularity of a source of each hyperlink Surfers get bored following links  with probability  =0.15 at any timestep a surfer jumps to a randomly selected page (not following any hyperlinks) Last rule also solves the ergodicity problem

Mathematics of the Google To calculate the PageRank Google solves a self-consistent Eq.: G i ~  j  i G j /K out (j) To account for random jumps: G i = (1-  )  j  i G j /K out (j) +   j G j /N = (1-  )  j  i G j /K out (j) +  (uses normalization: =  j G j /N =1) Pages with K out (j)=0 are removed

Matrix formulation Equivalent to finding the principal eigenvector (with =1) of the matrix (1-  ) T +  U, where T ij = 1/K out (j) if j  i and 0 otherwise, and U ij =1/N Could be easily solved iteratively by starting with G i (0)= 1 and repeating G (n+1) = (1-  ) T G (n) +  All G i > 

How Communities in the WWW influence the Google ranking H. Xie, K.-K. Yan, SM, cond-mat/

How do WWW communities influence their average G i ? Pages in a web-community preferentially link to each other. Examples: Pages from the same organization (e.g. SFI) Pages devoted to a common topic (e.g. physics) Pages in the same geographical location (e.g Santa Fe) Naïve argument: communities tend to “trap” random surfers to spend more time inside them  they should increase the Google ranking of individual webpages in the community

log 10 ( c ) # of intra-community links Test of a naïve argument Naïve argument is wrong! The effect could go either way Community #1 Community #2

E cc E ww

G c – average Google rank of pages in the community; G w =1 – in the outside world E cw G c / c – current from C to W It must be equal to: E wc G w / w – current from W to C Thus G c depends on the ratio between E cw and E wc – the number of edges (hyperlinks) between the community and the world

Balancing currents for nonzero  J cw =(1-  ) E cw G c / c +  G c N c – current from C to W It must be equal to: J cw =(1-  ) E wc G w / +  G w N w (N c /N w ) – current from W to C

For very isolated communities (E cw /E (r) cw <  and E wc /E (r) wc <  ) one has G c =1. Their Google rank is decoupled from the outside world! Overall range:  <G c <1/  What are the consequences?

WWW - the empirical data We have data for ~10 US universities (+ all UK and Australian Universities) Looked closely at Long Island University 4 large campuses 45,000 webpages and 160,000 hyperlinks After removing K out =0 left with ~15,000 webpages and 90,000 links Can do a mini-Google PageRank on this set alone

LIU communities LI University has 4 campuses. We looked at one of them (CWP Campus) E cw =1393; E (r) cw  16,000; E cw /E (r) cw ~ 0.09 <  =0.15 E wc =336; E (r) wc  12,500; E wc /E (r) wc ~ 0.03<  =0.15 This community should be decoupled from the outside world

Ratios are renormalized to E cw /E (r) cw ~0.01 and E wc /E (r) wc ~0.005

But: the community effect could be also strong!

 =0.001 Abnormally high PageRank

Top PageRank LIU websites for  =0.001 don’t make sense #1 hear.html' #5 …/higher_ed/ index.html #9 …/higher_ed/courses.html World Strongly connected component

Collaborators and postdoc info: Collaborators: Huafeng Xie – City University of NY Koon-Kiu Yan - Stony Brook U. Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall/Winter 2005 or even 2006 Topics: Large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) Internet/Google/Opinion networks CV and 3 letters of recommendation to: See

Part 2: Opinion networks "Extracting Hidden Information from Knowledge Networks", S. Maslov, and Y-C. Zhang, Phys. Rev. Lett. (2001). "Exploring an opinion network for taste prediction: an empirical study", M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.

Predicting customers’ tastes from their opinions on products Each of us has personal tastes Information about them is contained in our opinions on products Matchmaking: opinions of customers with tastes similar to mine could be used to forecast my opinions on untested products Internet allows to do it on large scale (see amazon.com and many others)

Opinion networks Webapges Other webpages WWW Opinions of movie-goers on movies Customers Movies opinion

Storing opinions XXX29?? XXX?8?8 XXX??1? 2??XXXX 98?XXXX ??1XXXX ?8?XXXX Matrix of opinions  IJ Network of opinions Customers Movies

Using correlations to reconstruct customer’s tastes Similar opinions  similar tastes Simplest model: Movie-goers  M-dimensional vector of tastes T I Movies  M-dimensional vector of features F J Opinions  scalar product:  IJ = T I  F J Customers Movies

Loop correlation Predictive power 1/M (L-1)/2 One needs many loops to best reconstruct unknown opinions L=5 known opinions: Predictive power of an unknown opinion is 1/M 2 An unknown opinion

Main parameter: density of edges The larger is the density of edges p the easier is the prediction At p 1  1/N (N=N costomers +N movies ) macroscopic prediction becomes possible. Nodes are connected but vectors T I and F J are not fixed: ordinary percolation threshold At p 2  2M/N > p 1 all tastes and features ( T I and F J ) can be uniquely reconstructed: rigidity percolation threshold

Real empirical data (EachMovie dataset) on opinions of customers on movies: 5-star ratings of 1600 movies by users 1.6 million opinions!

Spectral properties of  For M<N the matrix  IJ has N-M zero eigenvalues and M positive ones:  = R  R +. Using SVD one can “diagonalize” R = U  D  V + such that matrices V and U are orthogonal V +  V = 1, U  U + = 1, and D is diagonal. Then  = U  D 2  U + The amount of information contained in  : NM-M(M-1)/2 << N(N-1)/2 - the # of off- diagonal elements

Recursive algorithm for the prediction of unknown opinions 1. Start with  0 where all unknown elements are filled with (zero in our case) 2. Diagonalize and keep only M largest eigenvalues and eigenvectors 3. In the resulting truncated matrix  ’ 0 replace all known elements with their exact values and go to step 1

Convergence of the algorithm Above p 2 the algorithm exponentially converges to the exact values of unknown elements The rate of convergence scales as (p-p 2 ) 2

Reality check: sources of errors Customers are not rational!  IJ = r I  b J +  IJ (idiosyncrasy) Opinions are delivered to the matchmaker through a narrow channel: Binary channel S IJ = sign(  IJ ) : 1 or 0 (liked or not) Experience rated on a scale 1 to 5 or 1 to 10 at best If number of edges K, and size N are large, while M is small these errors could be reduced

How to determine M? In real systems M is not fixed: there are always finer and finer details of tastes Given the number of known opinions K one should choose M eff  K/(N readers +N books ) so that systems are below the second transition p 2  tastes should be determined hierarchically

Avoid overfitting Divide known votes into training and test sets Select M eff so that to avoid overfitting !!! Reasonable fit Overfit

Knowledge networks in biology Interacting biomolecules: key and lock principle Matrix of interactions (binding energies)  IJ = k I  l J + l I  k J Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones Many experiments measure S IJ =  (  IJ -  th ) k (1) k (2) l (2) l (1)

Collaborators: Yi-Cheng Zhang – U. of Fribourg Marcel Blattner – U. of Fribourg

Postdoc position Looking for a postdoc to work in my group at Brookhaven National Laboratory in New York starting Fall 2005 Topic - large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics) CV and 3 letters of recommendation to: See

THE END

Information networks Why the research into properties of complex networks is so active lately? Biology: lots of large-scale experimental data is generated in the last 10 years: most of it is on the level of networks The explosive growth of information networks (WWW and the Internet) is what fuels it all (directly or indirectly)!

Analysis Derived for  =0 Uses a strong mean field approximation that nodes that send current to and from the community have average G i for the outside world (G w =1) and community (G c ) In a true community both E cw and E wc are smaller than in randomized network but the effect depends on the competition between them

 =0.15

Networks with artificial communities To test we generate a scale-free network with an artificial community of N c pre-selected nodes Use Metropolis Algorithm with H=-(# of intra-community nodes) and some inverse temperature  Detailed balance:

Modules in networks and how to detect them using the Random walks/diffusion K. Eriksen, I. Simonsen, SM, K. Sneppen, PRL (2003)

What is a module? Nodes in a given module (or community group or functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function or sub-cellular localization WWW – websites on a common topic Internet – geography or organization (e.g. military)

Do you see any modules here?

Random walkers on a network Study the behavior of many VIRTUAL random walkers on a network At each time step each random walker steps on a randomly selected neighbor They equilibrate to a steady state n i ~ k i (solid state physics: n i = const) Slow modes allow to detect modules and extreme edges

Matrix formalism

Eigenvectors of the transfer matrix T ij

US Military Russia

RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence) ?? FR FR FR ?? FR ?? RU RU RU ?? ?? RU ?? US ?? US ?? ?? ?? ?? (US Navy) NZ NZ NZ NZ NZ NZ NZ KR KR KR KR KR ?? KR UA UA UA UA UA UA UA

Hacked site