A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Spatial point patterns and Geostatistics an introduction
VO-DAS Chenzhou CUI Chao LIU, Haijun TIAN, Yang YANG, etc National Astronomical Observatories, CAS.
Clusters & Super Clusters Large Scale Structure Chapter 22.
Markov-Chain Monte Carlo
Probabilistic Reasoning with Uncertain Data Yun Peng and Zhongli Ding, Rong Pan, Shenyong Zhang.
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Why is Rotation Speed Proportional to Mass? mv 2 /R = GMm/R 2 ; centrifugal force = gravitational force due to the mass “M” within the radius “R.” =>
July 7, 2008SLAC Annual Program ReviewPage 1 Weak Lensing of The Faint Source Correlation Function Eric Morganson KIPAC.
Galaxies PHYS390 Astrophysics Professor Lee Carkner Lecture 20.
Binary Stars Astronomy 315 Professor Lee Carkner Lecture 9.
Exploring the Stellar Populations of Early-Type Galaxies in the 6dF Galaxy Survey Philip Lah Honours Student h Supervisors: Matthew Colless Heath Jones.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.
First Results from an HST/ACS Snapshot Survey of Intermediate Redshift, Intermediate X-ray Luminosity Clusters of Galaxies: Early Type Galaxies and Weak.
Moving towards a hierarchical search. We now expand the coherent search to inspect a larger parameter space. (At the same time the incoherent stage is.
GALAXIES, GALAXIES, GALAXIES! A dime a dozen… just one of a 100,000,000,000! 1.Galaxy Classification Ellipticals Dwarf Ellipticals Spirals Barred Spirals.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.
March 21, 2006Astronomy Chapter 27 The Evolution and Distribution of Galaxies What happens to galaxies over billions of years? How did galaxies form?
FLANN Fast Library for Approximate Nearest Neighbors
GIANT TO DWARF RATIO OF RED-SEQUENCE GALAXY CLUSTERS Abhishesh N Adhikari Mentor-Jim Annis Fermilab IPM / SDSS August 8, 2007.
Objectives of Multiple Regression
Leicester, February 24, 2005 VisIVO, a VO-Enabled tool for Scientific Visualization and Data Analysis. VO-TECH Project. Stage01 Ugo Becciani INAF – Astrophysical.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Weak Lensing 3 Tom Kitching. Introduction Scope of the lecture Power Spectra of weak lensing Statistics.
Application of Gravitational Lensing Models to the Brightest Strongly Lensed Lyman Break Galaxy – the 8 o’clock arc E. Buckley-Geer 1, S. Allam 1,2, H.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
G O D D A R D S P A C E F L I G H T C E N T E R 1 Global Precipitation Measurement (GPM) GV Data Exchange Protocol Mathew Schwaller GPM Formulation Project.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
A Neural Network MonteCarlo approach to nucleon Form Factors parametrization Paris, ° CLAS12 Europen Workshop In collaboration with: A. Bacchetta.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
UNIT NINE: Matter and Motion in the Universe  Chapter 26 The Solar System  Chapter 27 Stars  Chapter 28 Exploring the Universe.
Radio Galaxies and Quasars Powerful natural radio transmitters associated with Giant elliptical galaxies Demo.
What can we learn from the luminosity function and color studies? THE SDSS GALAXIES AT REDSHIFT 0.1.
Statistical Analysis Mean, Standard deviation, Standard deviation of the sample means, t-test.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Science with the Virtual Observatory Brian R. Kent NRAO.
1 GALEX Angular Correlation Function … or about the Galactic extinction effects.
Space Asteroids Raynaldo 6B.
Chapter 7 Sample Variability. Those who jump off a bridge in Paris are in Seine. A backward poet writes inverse. A man's home is his castle, in a manor.
Federation and Fusion of astronomical information Daniel Egret & Françoise Genova, CDS, Strasbourg Standards and tools for the Virtual Observatories.
Wiss. Beirat AIP, ClusterFinder & VO-Methods H. Enke German Astrophysical Virtual Observatory ClusterFinder VO Methods for Astronomical Applications.
Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,
A PPARC funded project Workflow and Job Control in Astrogrid Jeff Lusted Dept Physics and Astronomy University of Leicester.
European Space Astronomy Centre (ESAC) Villafranca del Castillo, MADRID (SPAIN) Applications May 2006, Victoria, Canada VOQuest A tool.
The COMPASS (Catalogs of Objects and Measure Parameters for All Sky Surveys) Database Overview Gretchen Greene, Brian McLean, David Wolfe, and Charles.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
Kevin Cooke.  Galaxy Characteristics and Importance  Sloan Digital Sky Survey: What is it?  IRAF: Uses and advantages/disadvantages ◦ Fits files? 
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
Galactic structure and star counts Du cuihua BATC meeting, NAOC.
E. Solano. GAIA Meeting, Menorca, Oct 2009 GAIA and the Virtual Observatory Enrique Solano, LAEX/CAB (INTA-CSIC) Spanish VO Principal Investigator.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
Budapest Group Eötvös University MAGPOP kick-off meeting Cassis 2005 January
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Wide-field Infrared Survey Explorer (WISE) is a NASA infrared- wavelength astronomical space telescope launched on December 14, 2009 It’s an Earth-orbiting.
Catalogs contain hundreds of millions of objects
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Transfer Learning in Astronomy: A New Machine Learning Paradigm
Jean Ballet, CEA Saclay GSFC, 31 May 2006 All-sky source search
Clustering (3) Center-based algorithms Fuzzy k-means
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
COMPASS Database SPACE TELESCOPE SCIENCE INSTITUTE Gretchen Greene
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Avalanche in Astronomy
Google Sky.
Presentation transcript:

A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

ROADMAP Background Information Interesting Astronomy Data Mining Problems What has / not been done (Literature review) My project objectives The problem of Alignment in astronomy catalogs The Fundamental Plane A case study for recreating the Fundamental Plane from astronomy catalogs Experimental Results Efforts towards building Web services

Background Information  Next generation Astronomy catalogs will contain data for most of the sky  Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc  Terabytes and Peta bytes of Data  Data Avalanche in Astronomy  Getting useful information is like looking for a needle in a haystack  National Virtual Observatory (NVO) has been set up to facilitate scientific discovery  Obvious need for Distributed Data Mining

What kind of Data Mining activities are astronomers interested in ?  Detection of transient objects such as supernovae (Online transient object detection in real time)  Obtain statistics of variable and moving objects (model variability, refine existing models, fit models to irregularly sampled data )  Parameterize shapes of objects using rotationally invariant quantities  Efficient cluster and outlier detection  Supervised Data Mining problems (match objects detected in multiple bands, derive photometric red shifts)

What has/not been done  Lot of efforts in centralized data mining (NVO, FMass, Class X, FIRST etc )  Some grid mining (Notable GRIST project)  Very few distributed data mining efforts in their preliminary stages ( ) ( )

Objectives of this project  Aligning of Catalogs (The Fundamental Plane Problem)  Implementation of algorithms for Distributed Data Mining on Astronomy Catalogs  Development of webservices for the catalogs / investigation into what needs to be done to integrate this into the NVO

Alignment of Astronomy Catalogs Cross matching is a non trivial problem in itself. We assume cross matching happens off line and there exists an indexing scheme by which catalogs know the exact cross matched tuples

Some interesting numbers  Size of current SDSS catalogs 3.0 TB, contains about 180 million objects (As per Data Release 4)  2Mass has already observed 99% of the sky and reports 470,992,970 Point sources and 1,647,599 Extended sources Portion of the sky observed by SDSS

Problems  Cross Matching is an inherently difficult problem for the astronomy catalogs  We assume data sets are cross matched and this computation is done offline  This is a strong assumption and often may not be acceptable to astronomers

A real life cross matching Exercise Problems encountered  Which catalogs to use ?  We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog  Catalogs have different indexing schemes – more recent ones use HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even Names of objects  Some attributes are really not available ! (SDSS has for most of its red shift values)  Different catalogs observe different portions of the sky (SDSS covers only about 16% of the sky in the latest release while 2Mass covers the entire sky) – Select subsets to cross match wisely !

The successful cross matching …..  Chose a region of the sky between 0 and 15 (dec) and 150 and 200 degrees (ra) – observed by both SDSS and 2Mass  Use a web interface provided by SDSS to do the cross matching  Selected the K-band for obtaining red shift and surface brightness (astronomical significance) Case Study  Centralized database 1249 cross matched objects  Attributes are size, surface brightness, velocity dispersion  Does not really make a case for a distributed data mining scenario ! Solution - try a larger subset of the data from both catalogs - try a larger subset of the data from both catalogs

The Fundamental Plane  Interesting problem in astronomy - Identify correlations in high dimensional spaces  For the class of elliptical and spiral galaxies Observed features – radius, mean surface brightness and central velocity dispersion Observed features – radius, mean surface brightness and central velocity dispersion A two dimensional plane in the observed space of 3D parameters exist called THE FUNDAMENTAL PLANE

An illustration of the Fundamental Plane

Experimental Results  First PC captured % of variance  Second PC captured % of the variance  The astronomy literature suggests 1 st and 2 nd PC together should capture about 88% of variance Reasonably close recreation of the Fundamental Plane from two cross matched data sets in the centralized setting

Algorithm for Distributed Covariance Computation  A central co-ordination site S sends A and B a random number generation seed  A and B generate and n X l Random matrix R where l << n  A and B send S – R T A and R T B  S computes ( R A ) T (RB) / n

Experimental Results – Distributed Setting Case Study  1249 attributes at site A and B  2 attributes at site A and 1 attribute at site B

More results

Development of a Web Service Architecture of the Proposed System CLIENT SITE A SITE B WEB SERVICE For Distributed Covariance Computation Soap Message

Current Implementation  Using Apache Axis (SOAP engine – a framework for making SOAP processors such as clients, servers )  Tomcat version 4.1  SOAP version 1.2  Short Demo  Further System Developmental Issues (use of SOAP with attachments)

QUESTIONS ?