Big Data Infrastructure

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, February 21, 2013 Session 5: Graph Processing This work is licensed.
大规模数据处理 / 云计算 Lecture 6 – Graph Algorithm 彭波 北京大学信息科学技术学院 4/26/2011 This work is licensed under a Creative Commons.
Thanks to Jimmy Lin slides
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Absorbing Random walks Coverage
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Link Analysis, PageRank and Search Engines on the Web
Fall 2007CS 2251 Graphs Chapter 12. Fall 2007CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs To.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Big Data Infrastructure Week 5: Analyzing Graphs (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Big Data Infrastructure Week 5: Analyzing Graphs (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Big Data Infrastructure Week 11: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CSE 454 Advanced Internet Systems University of Washington
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Data-Intensive Distributed Computing
湖南大学-信息科学与工程学院-计算机与科学系
Centrality in Social Networks
MapReduce and Data Management
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
Graph Algorithms Ch. 5 Lin and Dyer.
CS 440 Database Management Systems
9 Algorithms: PageRank.
Data-Intensive Distributed Computing
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Graph Algorithms Ch. 5 Lin and Dyer.
Presentation transcript:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2017w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Parallel BFS in MapReduce Data representation: Key: node n Value: d (distance from start), adjacency list Initialization: for all nodes except for start node, d =  Mapper: m  adjacency list: emit (m, d + 1) Remember to also emit distance to yourself Sort/Shuffle: Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path Remember to pass along the graph structure!

Implementation Practicalities HDFS map reduce Convergence? HDFS

Visualizing Parallel BFS

Non-toy?

Application: Social Search Source: Wikipedia (Crowd)

Social Search When searching, how to rank friends named “John”? Assume undirected graphs Rank matches by distance to user Naïve implementations: Precompute all-pairs distances Compute distances at query time Can we do better?

All Pairs? Floyd-Warshall Algorithm: difficult to MapReduce-ify… Multiple-source shortest paths in MapReduce: Run multiple parallel BFS simultaneously Assume source nodes { s0 , s1 , … sn } Instead of emitting a single distance, emit an array of distances, wrt each source Reducer selects minimum for each element in array Does this scale?

Landmark Approach (aka sketches) Select n seeds { s0 , s1 , … sn } Compute distances from seeds to every node: A = [2, 1, 1] B = [1, 1, 2] C = [4, 3, 1] D = [1, 2, 4] Nodes Distances to seeds What can we conclude about distances? Insight: landmarks bound the maximum path length Run multi-source parallel BFS in MapReduce! Lots of details: How to more tightly bound distances How to select landmarks (random isn’t the best…)

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph Generic recipe: Represent graphs as adjacency lists Perform local computations in mapper Pass along partial results via outlinks, keyed by destination node Perform aggregation in reducer on inlinks to a node Iterate until convergence: controlled by external “driver” Don’t forget to pass the graph structure between iterations

(What’s the “Page” in PageRank?) (The original “secret sauce” for evaluating the importance of web pages) (What’s the “Page” in PageRank?)

Random Walks Over the Web Random surfer model: User starts at a random Web page User randomly clicks on links, surfing from page to page PageRank Characterizes the amount of time spent on any given page Mathematically, a probability distribution over pages Use in web ranking Correspondence to human intuition? One of thousands of features used in web search

PageRank: Defined … Given page x with inlinks t1…tn, where C(t) is the out-degree of t  is probability of random jump N is the total number of nodes in the graph t1 X t2 … tn

Computing PageRank Remember this? A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph Sketch of algorithm: Start with seed PRi values Each page distributes PRi mass to all pages it links to Each target page adds up mass from in-bound links to compute PRi+1 Iterate until values converge

Simplified PageRank First, tackle the simple case: No random jump factor No dangling nodes Then, factor in these complexities… Why do we need the random jump? Where do dangling nodes come from?

Sample PageRank Iteration (1) 0.1 n1 (0.2) 0.1 0.1 0.1 0.066 0.066 0.066 n5 (0.2) n3 (0.2) 0.2 0.2 n4 (0.2)

Sample PageRank Iteration (2) 0.033 0.083 n1 (0.066) 0.083 0.033 0.1 0.1 0.1 n5 (0.3) n3 (0.166) 0.3 0.166 n4 (0.3)

PageRank in MapReduce Map Reduce n1 [n2, n4] n2 [n3, n5] n3 [n4]

PageRank Pseudo-Code

PageRank vs. BFS PageRank BFS Map PR/N d+1 Reduce sum min A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph

Complete PageRank Two additional complexities What is the proper treatment of dangling nodes? How do we factor in the random jump factor? Solution: second pass to redistribute “missing PageRank mass” and account for random jumps p is PageRank value from before, p' is updated PageRank value N is the number of nodes in the graph m is the missing PageRank mass One final optimization: fold into a single MR job

Implementation Practicalities HDFS map map reduce Convergence? HDFS HDFS What’s the optimization?

PageRank Convergence Alternative convergence criteria Iterate until PageRank values don’t change Iterate until PageRank rankings don’t change Fixed number of iterations Convergence for web graphs? Not a straightforward question Watch out for link spam and the perils of SEO: Link farms Spider traps …

Log Probs PageRank values are really small… Solution? Product of probabilities = Addition of log probs Addition of probabilities?

More Implementation Practicalities How do you even extract the webgraph? Lots of details…

Beyond PageRank Variations of PageRank Variants on graph random walks Weighted edges Personalized PageRank Variants on graph random walks Hubs and authorities (HITS) SALSA

Applications Static prior for web ranking Identification of “special nodes” in a network Link recommendation Additional feature in any machine learning problem

Implementation Practicalities HDFS map map reduce Convergence? HDFS HDFS

MapReduce Sucks Java verbosity Hadoop task startup time Stragglers Needless graph shuffling Checkpointing at each iteration Spark to the rescue?

Let’s Spark! HDFS map reduce HDFS map reduce HDFS map reduce HDFS …

HDFS map reduce map reduce map reduce …

HDFS map Adjacency Lists PageRank Mass reduce map Adjacency Lists PageRank Mass reduce map Adjacency Lists PageRank Mass reduce …

HDFS map Adjacency Lists PageRank Mass join map Adjacency Lists PageRank Mass join map Adjacency Lists PageRank Mass join …

HDFS HDFS Adjacency Lists PageRank vector join flatMap reduceByKey PageRank vector join PageRank vector flatMap reduceByKey join …

Cache! HDFS HDFS Adjacency Lists PageRank vector join flatMap reduceByKey PageRank vector join PageRank vector flatMap reduceByKey join …

MapReduce vs. Spark Source: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-2-amp-camp-2012-standalone-programs.pdf

Spark to the rescue? Java verbosity Hadoop task startup time Stragglers Needless graph shuffling Checkpointing at each iteration What have we fixed?

Still not particularly satisfying? HDFS HDFS Adjacency Lists PageRank vector Cache! join flatMap reduceByKey Still not particularly satisfying? PageRank vector join PageRank vector flatMap reduceByKey join …

Stay Tuned! Source: https://www.flickr.com/photos/smuzz/4350039327/

Questions? Source: Wikipedia (Japanese rock garden)