天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Introduction Using time property and location property from lost items’ pictures, we construct the Lost and Found System which combined with image search.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
How to speed up search of ILMT light curves using the HTM (Hierarchical Triangular Mesh) method in relational databases ARC Liège, 11 February 2010 ILMT.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Hadoop Ali Sharza Khan High Performance Computing 1.
Concurrent Algorithms. Summing the elements of an array
An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Programming in Hadoop Guangda HU Huayang GUO
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Astronomical Data Processing & Workflow Scheduling in cloud
Nebula A cloud-based back end for
Introduction to MapReduce and Hadoop
Concurrent Algorithms
Central Florida Business Intelligence User Group
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Concurrent Algorithms
Interpret the execution mode of SQL query in F1 Query paper
Group 15 Swathi Gurram Prajakta Purohit
Concurrent Algorithms
Concurrent Algorithms
Concurrent Algorithms
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function based on Boundary Growing Model –Cross-match in distributed environment based on MapReduce model Plan & Discussion

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

Our Previous Function PHIXmatch —— Paralleled Healpix-Indexing Xmatch Test Dataset : SDSS(100million) ×2MASS ( 470million ) Function: Spatial Join Results: SDSS_IDTwomass_IDDistance e e e

HEALPix Index Function HEALPix —— Hierarchical Equal Area isoLatitude Pixelization of a sphere. Quadtree pixel numbering

What we have resolved Resolve the border-block problem A fast bitwise operation algorithms to deduce the neighbor blocks’ index number Realize parallel cross-match computation in multi-core environment

Results & Performance Analysis FunctionTable AData Amount of A Table BData Amounts of B TimeFinish amounts /sec PHIXmatch function SDSS100,106,8112MASS min52,139 GaoDan’s Function Part of GSC ,832Part of GSC ,8325.6min880 Results Conclusion Has marked performance superiority comparing with previous functions and is applicable to large-scale cross-match on multi-core system Paper: Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao, A Paralleled Large-Scale Astronomical Cross-Matching Function, International Conference on Algorithms and Architectures for Parallel Processing (ica3pp) 2009, LNCS5574: p604~614

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

Adaptability Research on HTM-Indexed Data HTM—Hierarchical Triangular Mesh Resolve the border-data problem in HTM

Results of HTM version Xmatch 42min Why the results is poor compared with HEALPix version? Answer: the triangle-shape!

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

New function based on Boundary Growing Model Database reading operation is too time-consuming, especially for the border data!

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

MapReduce A software framework introduced by Google to support distributed computing on large data sets on clusters of computers. –Huge datasets –Distributable application –Data stored either in a filesystem (unstructured) or within a database (structured) Map step & Reduce step –Map: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. –Reduce: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

Map Step & Reduce Step Input Map Reduce Result Shuffle/SortChop/replicate

Apache Hadoop A Java software framework inspired by Google’s MapReduce and Google File System papers. What function does it perform? Easy programming, auto scheduling, error detection & correction, Who use Hadoop? –Yahoo! – web search; advertising businesses –Amazon – S3, EC2 –IBM & Google – computation plat for Universities –Institute of Computing Technology, Chinese Academy of Sciences -- PBminer Page links: 1 T output: over 300 TB, compressed! Number of cores in a job: over 10,000 disk in the cluster: over 5 P

Hadoop Architecture

Why using MapReduce to Xmatch Near-linear speedup, comparing with MPI cluster Suitable for data-intensive, compute- intensive application, low-cost! Have been used in many Data Mining application, maybe useful for more complex cross-match functions.

Plan & Discussion Service for larger data sets (TB) and various catalogs such as… –Interfaces for more kinds of catalogs –Additional measures to deal with TB-level data –Parallelizing other cross-match functions

天文信息技术联合实验室 Thank you! We need your help!