DISC-Finder: A distributed algorithm for identifying galaxy clusters.

Slides:

Advertisements

Similar presentations

Chapter 4 Partition I. Covering and Dominating.

Advertisements

all-pairs shortest paths in undirected graphs

Poly-Logarithmic Approximation for EDP with Congestion 2

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.

A Model of Computation for MapReduce

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.

Hierarchical Decompositions for Congestion Minimization in Networks Harald Räcke 1.

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.

Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.

Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.

CSE 780 Algorithms Advanced Algorithms Minimum spanning tree Generic algorithm Kruskal’s algorithm Prim’s algorithm.

Chess Review May 10, 2004 Berkeley, CA Towards Distributed Diagnosis of Complex Physical Systems Presented by- Gautam Biswas Jyoti Gandhe Xenofon D. Koutsoukos.

Massive Graph Visualization: LDRD Final Report Sandia National Laboratories Sand Printed October 2007.

CSE 222 Systems Programming Graph Theory Basics Dr. Jim Holten.

A Solution for the Bandwidth Optimization Problem with 2 Stations Rahul Bijlani, Louis Deaett and Maksim Orlovich.

CSE 373, Copyright S. Tanimoto, 2002 Up-trees - 1 Up-Trees Review of the UNION-FIND ADT Straight implementation with Up-Trees Path compression Worst-case.

Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.

Balanced Graph Edge Partition ACM KDD 2014 Florian Bourse ENS Marc Lelarge INRIA-ENS Milan Vojnovic Microsoft Research.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.

Graph Partitioning and Clustering E={w ij } Set of weighted edges indicating pair-wise similarity between points Similarity Graph.

Distributed Computing Rik Sarkar. Distributed Computing Old style: Use a computer for computation.

CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct

0 Course Outline n Introduction and Algorithm Analysis (Ch. 2) n Hash Tables: dictionary data structure (Ch. 5) n Heaps: priority queue data structures.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.

Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

CS774. Markov Random Field : Theory and Application Lecture 02

Data Structures and Algorithms in Parallel Computing Lecture 2.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,

Chapter 4 Partition (1) Shifting Ding-Zhu Du. Disk Covering Given a set of n points in the Euclidean plane, find the minimum number of unit disks to cover.

SPARSE CERTIFICATES AND SCAN-FIRST SEARCH FOR K-VERTEX CONNECTIVITY

MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.

Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.

MapReduce Compiler RHadoop

Register Transfer Specification And Design

Divide-and-Conquer MST

Hadoop-Harp Applications Performance Analysis on Big Red II

Cloud Data Anonymization Using Hadoop Map-Reduce Framework With Qos Evaluation and Behaviour analysis PROJECT GUIDE: Ms.S.Subbulakshmi TEAM MEMBERS: A.Mahalakshmi( ).

IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS

Large Scale Data Processing Techniques for Astronomical Applications

NetMine: Mining Tools for Large Graphs

Sameh Shohdy, Yu Su, and Gagan Agrawal

Community detection in graphs

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

On Spatial Joins in MapReduce

Randomized Algorithms CS648

Connected Components Minimum Spanning Tree

Randomized Algorithms CS648

CS110: Discussion about Spark

CS223 Advanced Data Structures and Algorithms

A Fundamental Bi-partition Algorithm of Kernighan-Lin

CS 584 Project Write up Poster session for final Due on day of final

Algorithms Lecture # 27 Dr. Sohail Aslam.

CSE 332: Minimum Spanning Trees

Presentation transcript:

DISC-Finder: A distributed algorithm for identifying galaxy clusters

Friends-of-Friends (FoF) technique: Identification of galaxy clusters Sequential algorithms Exact: O((n ∙ log n) 1.5 ) Approximate: O(n) We need to identify its connected components Two galaxies are “friends” if they are close to each other We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges

Distributed procedure Divide the space into “slightly overlapping” cubes Identify cross-cube edges and merge the respective clusters Load balancing: -Randomly select a subset of galaxies -Apply the kd-tree construction to build a balanced partition for the subset -Use it for the full set of galaxies -Use any sequential FoF -Allocate different cores to cubes -Apply the union-find algorithm to the galaxies in the cube overlaps Distributed computation: Apply a sequential FoF algorithm to find the clusters within each cube

Distributed procedure galaxy sets local clusters divide the space into cubes apply local sequential FoF

Advantages Scalable: We can apply it to massive datasets and use all available cores Black-box use of a sequential FoF: We can utilize any FoF algorithm Hadoop friendly: We have mapped all main operations into the Hadoop framework, which has resulted in very compact code (800 lines)

Scalability Time (min) Number of cores 1000 mln galaxies 500 mln galaxies 14,800 mln galaxies