Mining social networks for knowledge management Prabhakar Raghavan.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Advanced Topics in Data Mining Special focus: Social Networks.
Identity and search in social networks Presented by Pooja Deodhar Duncan J. Watts, Peter Sheridan Dodds and M. E. J. Newman.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Topology Generation Suat Mercan. 2 Outline Motivation Topology Characterization Levels of Topology Modeling Techniques Types of Topology Generators.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Information Retrieval IR10 Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS Lecture 6 Generative Graph Models Part II.
CS276B Text Retrieval and Mining Winter 2005 Lecture 11.
The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden) Andrei.
Advanced Topics in Data Mining Special focus: Social Networks.
INF 2914 Web Search Lecture 4: Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS347 Lecture 6 April 25, 2001 ©Prabhakar Raghavan.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Computer Science 1 Web as a graph Anna Karpovsky.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Chapter 20: Social Service Selection Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Friends and Locations Recommendation with the use of LBSN
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
Information retrieval Lecture 9 Recap and today’s topics Last lecture web search overview pagerank Today more sophisticated link analysis using links.
PrasadL17LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Lecture 14: Link Analysis
CS276 Lecture 18 Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 18: Link analysis.
Section 8 – Ec1818 Jeremy Barofsky March 31 st and April 1 st, 2010.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation IMA Tutorial (part II): Measurement and modeling of the web and related data.
© 2008 IBM Corporation ® Atlas for Lotus Connections Unlock the power of your social network! Customer Overview Presentation An IBM Software Services for.
Small World Social Networks With slides from Jon Kleinberg, David Liben-Nowell, and Daniel Bilar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 21: Link analysis.
Introduction to Information Retrieval LINK ANALYSIS 1.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Web Search and Tex Mining Lecture 9 Link Analysis.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Online Social Networks and Media
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter 20: Social Service Selection Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
1 CS 430: Information Discovery Lecture 5 Ranking.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Models of Web-Like Graphs: Integrated Approach
Response network emerging from simple perturbation Seung-Woo Son Complex System and Statistical Physics Lab., Dept. Physics, KAIST, Daejeon , Korea.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Social Networks Some content from Ding-Zhu Du, Lada Adamic, and Eytan Adar.
Modified by Dongwon Lee from slides by
Information Retrieval Christopher Manning and Prabhakar Raghavan
PageRank algorithm based on Eigenvectors
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Graph and Link Mining.
Presentation transcript:

Mining social networks for knowledge management Prabhakar Raghavan

Overview Sampling of social network studies The knowledge management challenge › Enterprise complications Extended social networks › Network and tensor-based models Power law behaviors › Why, and what they mean for mining A research agenda

Milgram’s experiments Began with volunteers from Omaha, NE. Asked to get a letter to a physician near Boston. Could only send to first-name acquaintance, to be forwarded etc. Median path length of successful deliveries was 6. Led to famous “6 degrees of separation” folklore.

cliques: Schwartz/Wood Studied (sub)graph. Proposed metrics for groups of people to share interests; cluster analysis. Qualitatively “good” results. Raised issues of ethical use of data and privacy.

Various other projects PHOAKS › Extracting heavily cited resources in newsgroups, etc. Call graphs › Discerning home, business and fax lines › Calling circles. Recommendation systems › Input: users’ product endorsements. › Output: product recommendations to each user.

Trawling bipartite cliques Take the (directed) Web link graph. Enumerate all (small) bipartite cliques. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1998).

Insights from hubs Link-based hypothesis: Dense bipartite subgraph  Web community. Hub Authority FansCenters

Communities from cores What is a “dense bipartite subgraph”? Define (i,j)-core: complete bipartite subgraph with i nodes all of which point to each of j others. Enumerate (i,j)-cores for various small i,j. (2,3) core

Results for cores

Japanese Elementary Schools The American School in Japan The Link Page ‰ªèŽs—§ˆä“c¬ŠwZƒz[ƒ€ƒy[ƒW Kids' Space ˆÀéŽs—§ˆÀé¼”¬ŠwZ ‹{é‹³ˆç‘åŠw‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“ސ쌧E‰¡lŽs— §’†ì¼¬ŠwZ‚̃y fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... schools LINK Page-13 “ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒ€ƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) ‚l‚f‚j¬ŠwZ‚U”N‚P‘g¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쏬ŠwZ‚̃z[ƒ€ƒy[ƒW UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP ŽÂ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒ€ƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ CentersFans

Yenta: Forman Analyzes documents “associated” with each user. Distils significant “interests” for each. Matches/clusters groups of users with overlapping interests. Decentralized; aims for privacy protection. Elements of peer-to-peer operation.

ReferralWeb: Kautz/Selman Establishes links between people, e.g., › co-authorship › colleagues in an organization Allows search through this social network, e.g., › find me someone within distance 2 who will referee a paper on xyz...

Who can I ask to review a paper on “expander graphs”? Source: H. Kautz & B. Selman

Paths to Experts Source: H. Kautz & B. Selman

Observations Source: H. Kautz & B. Selman Official company hierarchy only a sparse subset of the corporate social network Shortest (and often best) paths involve a combination of official and unofficial links › Conditions for trust and evaluation may greatly differ › Global social network is the union of many different kinds of sub-networks Search greatly aided when user can choose different views of the network types of edge strength of edge

Knowledge management The big challenge: Increase productivity in knowledge workers by getting them the expertise they need at all times > the right information (documents?) > the right experts. Enterprise: group of people engaged in a collective endeavour, typically with proprietary content.

Enterprise knowledge mgmt. Examples - Schwartz/Wood, ReferralWeb, Yenta, PHOAKS all have some applicability. › ReferralWeb was originally devised and deployed at AT&T Labs. Enterprise knowledge management introduces some novel challenges.

Challenges in the enterprise Information resides in heterogeneous › formats ( , pdf, word, …) › repositories (Lotus, Exchange, Documentum, databases …) › applications (HR, ERP, Siebel, …) Need to combine structured relations (from applications) with learning.

Challenges in the enterprise Data security: information units have many different access classes. › e.g., compound documents have pieces, each with its own access lists. › My search should hit the doc only if it hits the pieces I can see. Knowledge security is the deeper issue. › Learning: consider class models learned from security-limited information. › Inferences (what does a recommendation tell me about confidential data?)

General formulation How do we combine different sources of content and context? › terms in docs › links between docs › users’ access patterns › users’ profile information.

General formulation Every item of interest - each term, query, doc, person, treated as a node. Impose similarity metric between pairs of nodes. Need to be able to measure proximity from sets of nodes (a person+a doc they’re viewing+a query they’ve issued) to nodes of a target type (a person).

Issues in formulation If a user is close to two docs d1 and d2, are the docs d1 and d2 close to each other? How do you measure proximity from a set of nodes? How do you capture collaborative (as opposed to content and context-based filtering). How do you succinctly represent and manipulate similarities?

Graph-based models Each node an entity, associated with a set of features. Pairwise similarities based on feature matches. Issues: › Not easy to do proximity from sets of nodes. › Have to maintain (quadratic) pairwise information. › Consistency.

Tensor-based models Turn every entity into a vector. Axes are terms, profile features, … Combination of user, context++ becomes a tensor. Measure proximity to tensors of a certain type (e.g., user, doc recommendation).

Context with content Docs’ content captured in term axes. Other attributes (user profile, etc.) captured in other axes. A probe consists of 1 : a tensor t (say, a user vector plus a query) 2 : a type of vector to be retrieved (say, a user). Result = vectors of chosen type closest to t.

“Standard” mining tricks Dimensionality reduction - for collaborative filtering. Hierarchical clustering - for fast near- neighbor search. Incremental indexes - real-time updates.

Upshot Verity social networks project Screenshot. Screenshot Security issues remain thorny. What aspects of social behavior can we exploit in the algorithms? › Power laws

Power laws in mining

Recurring phenomena Many interesting distributions › term frequencies in a corpus › citations › in-links to web pages › document access frequencies … follow an inverse polynomial function.

Zipf versus power laws We call a distribution on the positive integers › a power law if it’s of the form p(i) ~ 1/i . › a Zipf law if p(i)~1/j  where j is the rank of i. Typically  >1.

Other Zipf/power laws Populations of US cities Degrees of internet nodes See

What leads to power laws? “Scale free” growth. “Highly optimized tolerance”. Behavioral models. › Model behavior of individuals in social network.

In-degrees on the Web graph Web in-degrees are distributed as p(i) ~ 1/i 2.1. › Consistently across many independent studies. Erdos-Renyi random graphs would not lead to such a power law. Need a new stochastic model for such graphs.

Random replication graphs Central thesis - random replication in an evolving graph. › Some page creators create content without regard to what exists on the web. › Many are inspired by pre-existing content. › i.e., some links are random, others are copied from pre-existing pages.

Model details Evolution: Nodes are created in a sequence of discrete time steps › e.g. at each time step, a new node is created with d=O(1) out-links Probabilistic copying › links go to random nodes with probability › copy d links from a random “existing” node with probability

Theoretical Results New model yields › convergence to power-law in-degrees; › › number of bipartite cliques that grows with time; › evolution without copying would not yield these phenomena. R. Kumar, P. Raghavan, S. Rajagopalan, R. Sivakumar, A. Tomkins, E. Upfal (2000).

Compound structures “First order” structures (terms frequencies, in-degrees, citations) exhibit power laws. What about “higher order” structures (pairs of terms, bipartite cliques, etc.)? Motivations: › Criteria for mining interesting higher order structures. › Turning algorithms for higher order mining.

Pair frequencies for terms Analyzed several corpora of news items. Studied frequencies of k-tuples of terms (k=1, 2, …) in › corpus › documents › sentences › windows of width w. Ongoing work with P. Tsaparas.

Sentence log-rank vs. log- frequency

Pair distributions Based on term frequencies, compute pair frequencies under independence assumption. Measure actual pair frequencies. › Outliers under mutual information measure. Higher order outliers: useful for building clusters/concept maps in corpora.

Pairs: independence vs. actual

Computational speedup Inspired by pruning algorithms in trawling. As higher order associations are built up, keep discarding obviated terms. › Docs keep getting shorter. › Fit in memory quickly. › Not an issue with relational tables?

A research agenda New ways of combining content, context and collaboration in the social network. Analyze and model structures in social networks. › Tune algorithms on models. › Build on “standard” mining paradigms: associations, clustering... Incorporate enterprise constraints: › Roles and profile information from apps › Security and Privacy!