Download presentation
Presentation is loading. Please wait.
1
DISC-Finder: A distributed algorithm for identifying galaxy clusters
2
Friends-of-Friends (FoF) technique: Identification of galaxy clusters Sequential algorithms Exact: O((n ∙ log n) 1.5 ) Approximate: O(n) We need to identify its connected components Two galaxies are “friends” if they are close to each other We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges
3
Distributed procedure Divide the space into “slightly overlapping” cubes Identify cross-cube edges and merge the respective clusters Load balancing: -Randomly select a subset of galaxies -Apply the kd-tree construction to build a balanced partition for the subset -Use it for the full set of galaxies -Use any sequential FoF -Allocate different cores to cubes -Apply the union-find algorithm to the galaxies in the cube overlaps Distributed computation: Apply a sequential FoF algorithm to find the clusters within each cube
4
Distributed procedure galaxy sets local clusters divide the space into cubes apply local sequential FoF
5
Advantages Scalable: We can apply it to massive datasets and use all available cores Black-box use of a sequential FoF: We can utilize any FoF algorithm Hadoop friendly: We have mapped all main operations into the Hadoop framework, which has resulted in very compact code (800 lines)
6
Scalability 8 16 3264 128 256 4 15 60 240 Time (min) Number of cores 1000 mln galaxies 500 mln galaxies 14,800 mln galaxies
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.