天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing.

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Introduction Using time property and location property from lost items’ pictures, we construct the Lost and Found System which combined with image search.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.

Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.

How to speed up search of ILMT light curves using the HTM (Hierarchical Triangular Mesh) method in relational databases ARC Liège, 11 February 2010 ILMT.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

Hadoop Ali Sharza Khan High Performance Computing 1.

Concurrent Algorithms. Summing the elements of an array

An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.

Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Programming in Hadoop Guangda HU Huayang GUO

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.

By: Joel Dominic and Carroll Wongchote 4/18/2012.

BIG DATA/ Hadoop Interview Questions.

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

MapReduce using Hadoop Jan Krüger … in 30 minutes...

Big Data is a Big Deal!.

MapReduce Compiler RHadoop

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Astronomical Data Processing & Workflow Scheduling in cloud

Nebula A cloud-based back end for

Introduction to MapReduce and Hadoop

Concurrent Algorithms

Central Florida Business Intelligence User Group

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Cse 344 May 4th – Map/Reduce.

Concurrent Algorithms

Interpret the execution mode of SQL query in F1 Query paper

Group 15 Swathi Gurram Prajakta Purohit

Concurrent Algorithms

Concurrent Algorithms

Concurrent Algorithms

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function based on Boundary Growing Model –Cross-match in distributed environment based on MapReduce model Plan & Discussion

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

Our Previous Function PHIXmatch —— Paralleled Healpix-Indexing Xmatch Test Dataset ： SDSS(100million) ×2MASS （ 470million ） Function: Spatial Join Results: SDSS_IDTwomass_IDDistance e e e

HEALPix Index Function HEALPix —— Hierarchical Equal Area isoLatitude Pixelization of a sphere. Quadtree pixel numbering

What we have resolved Resolve the border-block problem A fast bitwise operation algorithms to deduce the neighbor blocks’ index number Realize parallel cross-match computation in multi-core environment

Results & Performance Analysis FunctionTable AData Amount of A Table BData Amounts of B TimeFinish amounts /sec PHIXmatch function SDSS100,106,8112MASS min52,139 GaoDan’s Function Part of GSC ,832Part of GSC ,8325.6min880 Results Conclusion Has marked performance superiority comparing with previous functions and is applicable to large-scale cross-match on multi-core system Paper: Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao, A Paralleled Large-Scale Astronomical Cross-Matching Function, International Conference on Algorithms and Architectures for Parallel Processing (ica3pp) 2009, LNCS5574: p604~614

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

Adaptability Research on HTM-Indexed Data HTM—Hierarchical Triangular Mesh Resolve the border-data problem in HTM

Results of HTM version Xmatch 42min Why the results is poor compared with HEALPix version? Answer: the triangle-shape!

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

New function based on Boundary Growing Model Database reading operation is too time-consuming, especially for the border data!

Contents Our Previous Function New Improvements and Attempts –Discussion of Adaptability on HTM-Indexed Data –New function Plan & Discussion

MapReduce A software framework introduced by Google to support distributed computing on large data sets on clusters of computers. –Huge datasets –Distributable application –Data stored either in a filesystem (unstructured) or within a database (structured) Map step & Reduce step –Map: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. –Reduce: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

Map Step & Reduce Step Input Map Reduce Result Shuffle/SortChop/replicate

Apache Hadoop A Java software framework inspired by Google’s MapReduce and Google File System papers. What function does it perform? Easy programming, auto scheduling, error detection & correction, Who use Hadoop? –Yahoo! – web search; advertising businesses –Amazon – S3, EC2 –IBM & Google – computation plat for Universities –Institute of Computing Technology, Chinese Academy of Sciences -- PBminer Page links: 1 T output: over 300 TB, compressed! Number of cores in a job: over 10,000 disk in the cluster: over 5 P

Hadoop Architecture

Why using MapReduce to Xmatch Near-linear speedup, comparing with MPI cluster Suitable for data-intensive, compute- intensive application, low-cost! Have been used in many Data Mining application, maybe useful for more complex cross-match functions.

Plan & Discussion Service for larger data sets (TB) and various catalogs such as… –Interfaces for more kinds of catalogs –Additional measures to deal with TB-level data –Parallelizing other cross-match functions

天文信息技术联合实验室 Thank you! We need your help!