2003 May 24Clive Page Implementation of XMATCH function.

Slides:



Advertisements
Similar presentations
Theoretical Analysis. Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user.
Advertisements

Quiz 2 Review. For which of the following attributes would a hash- index most likely be a better fit than a B+-tree index? A. Social Security Number B.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Oracle spatial – Creating spatial tables Object Relational Model Creating Spatial Tables.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 8 Advanced SQL.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 8 Advanced SQL.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
1.Homogeneity 2.Isotropy 3.Universality 4.Cosmological Principle Matter is distributed evenly throughout the universe on the largest scales (~ 300 Mpc).
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Object Relational Model Creating Spatial Tables. Concepts Describe the schema associated with a spatial layer Explain how spatial data is stored using.
Inner join, self join and Outer join Sen Zhang. Joining data together is one of the most significant strengths of a relational database. A join is a query.
The Program Design Phases
Dark Matter Masses of Galaxies Gravity and Light Black Holes What is Dark Matter?
Clive Page University of Leicester Meeting at ROE January 25 (1)Cross-matching Catalogues (2)Column-based storage for data exploring.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
1 © Lecture note 3 Hypothesis Testing MAKE HYPOTHESIS ©
Geology 5670/6670 Inverse Theory 26 Jan 2015 © A.R. Lowry 2015 Read for Wed 28 Jan: Menke Ch 4 (69-88) Last time: Ordinary Least Squares (   Statistics)
GEO7600 Inverse Theory 09 Sep 2008 Inverse Theory: Goals are to (1) Solve for parameters from observational data; (2) Know something about the range of.
2003 April 151 Data Centres: Connecting to the Real World Clive Page.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 8 Advanced SQL.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 8 Advanced SQL.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
The Relational Model1 Nulls A dilemma. The Relational Model2 The Real World Situation A common situation “Date of birth not given” “Present whereabouts.
Confidence Intervals 1 Chapter 6. Chapter Outline Confidence Intervals for the Mean (Large Samples) 6.2 Confidence Intervals for the Mean (Small.
Confidence Intervals for the Mean (Large Samples) Larson/Farber 4th ed 1 Section 6.1.
Confidence Intervals for the Mean (σ known) (Large Samples)
Introduction to Hypothesis Testing: One Population Value Chapter 8 Handout.
Chapter 7 Advanced SQL Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Analysis of Algorithms
1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.
JOI/1 Data Manipulation - Joins Objectives –To learn how to join several tables together to produce output Contents –Extending a Select to retrieve data.
ADVANCED SQL SELECT QUERIES CS 260 Database Systems.
Chapter 6 USING PROBABILITY TO MAKE DECISIONS ABOUT DATA.
NULLs & Outer Joins Objectives of the Lecture : To consider the use of NULLs in SQL. To consider Outer Join Operations, and their implementation in SQL.
Confidence intervals and hypothesis testing Petter Mostad
Association techniques for the Virtual Observatory Bob Mann.
Relational Databases.  In week 1 we looked at the concept of a key, the primary key is a column/attribute that uniquely identifies the rest of the data.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Chapter 6 Procedural Language SQL and Advanced SQL Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition.
© The McGraw-Hill Companies, 2006 Chapter 2 Selection.
Copyright © Curt Hill Joins Revisited What is there beyond Natural Joins?
View 1. Lu Chaojun, SJTU 2 View Three-level vision of DB users Virtual DB views DB Designer Logical DB relations DBA DBA Physical DB stored info.
Data Structure and Algorithms. Algorithms: efficiency and complexity Recursion Reading Algorithms.
Methods Methods are how we implement actions – actions that objects can do, or actions that can be done to objects. In Alice, we have methods such as move,
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 4 Exploring Chemical Analysis, Harris
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Chapter Confidence Intervals 1 of 31 6  2012 Pearson Education, Inc. All rights reserved.
Chapter 6: Random Errors in Chemical Analysis. 6A The nature of random errors Random, or indeterminate, errors can never be totally eliminated and are.
Chapter 6 Confidence Intervals 1 Larson/Farber 4th ed.
Section 6.1 Confidence Intervals for the Mean (Large Samples) © 2012 Pearson Education, Inc. All rights reserved. 1 of 83.
All-sky source search Aim: Look for a fast method to find sources over the whole sky Selection criteria Algorithms Iteration Simulations Energy information.
Database Systems: Design, Implementation, and Management Tenth Edition
Estimating
Database Systems: Design, Implementation, and Management Tenth Edition
Confidence Intervals for the Mean (Large Samples)
Supervised machine learning: creating a model
Relational Database Operators
Presentation transcript:

2003 May 24Clive Page Implementation of XMATCH function

2003 May 24Clive Page Cross-matching Very important functionality – by combining datasets we often get new scientific results. In DBMS terms it needs a spatial join – the join criterion is the overlap of the error-regions. Error-region always small patch of sky, never just a point because of errors of measurement, extended objects, proper motions, etc. Shapes of error-regions vary: often elliptical, sometimes circular, occasionally more complex. Size depends on confidence level of the match – often expressed as say x% confidence, or y-sigma (the latter assumes some error distribution, e.g. Gaussian).

2003 May 24Clive Page Other cross-match complications Difference in epochs of catalogues means some objects will have moved – apply proper motions? –Need epoch metadata. Different users will want different confidence levels and hence sizes of error regions –Large region may produce too many false positives. –May need to adjust confidence level in the light of experience.

2003 May 24Clive Page Current ADQL Syntax …WHERE XMATCH(x, y, !z) > 3 AND … Some problems: XMATCH is not a quite a function: if a confidence level or N-sigma value is needed it should be one of the arguments. Better to express matching probability in X% confidence than as N-sigma, as former makes no assumptions about the functional form of the error distribution. No syntax for LEFT OUTER JOIN (return unmatched sources as well as matched ones).

2003 May 24Clive Page Specifying a join in SQL Two methods in the standards: –SELECT * FROM t1, t2 WHERE t1.x = t2.y … –SELECT * FROM t1 [LEFT OUTER] JOIN t2 ON t1.x = t2.y WHERE … The latter form is more verbose but –allows a number of different types of join, –the join criterion is explicit. Propose that we use latter, e.g. ON XMATCH(…)

2003 May 24Clive Page Some cross-match implementations Cross-match of two catalogues of ~N sources is an O(N²) operation unless some form of indexing/sorting is used, which can reduce it to O(N log N) or better. Known algorithms –Join using spatial index such as R-tree (Oracle, Sybase, MySQL, Postgres…) or Grid-file (DB2) –Join using pixel code and B-tree (more complex and slower, but feasible with just about any DBMS) –Sort/sweep algorithm of Dave Abel and colleagues. Note: all these use bounding boxes drawn around the error ellipses (or whatever). Refinement stage weeds out the false matches.

2003 May 24Clive Page Algorithm limitations With most algorithms, if the size of the error region changes it is necessary to generate new indices - slow. But: the hard part is done by the DBMS in reducing an O(N²) problem to one of O(N log N) or better. Proposed solution: –Always carry out the cross-match using largest error region that is scientifically justifiable (99.9% or 3σ) –The user can then refine the crude selection using the relatively small table of results, rejecting sources too far apart for the actual error-regions and confidence level. –In this case: can omit confidence level (or N-σ) value in the XMATCH function – apply only in refine stage.

2003 May 24Clive Page Selecting sources which have no counterpart Syntax is: XMATCH(x, !z) Standard RDBMS can do this using –LEFT OUTER JOIN of x and z –INNER JOIN of x and z –Take difference between results of the last two. Or is there a simpler way? I think that it is at least as important to have a defined syntax for LEFT OUTER JOIN as knowing which sources have no counterpart is often scientifically important. Propose plus symbol, e.g. XMATCH(x+, y)

2003 May 24Clive Page Cross-match can also find clusters of objects Scientific examples: find clusters of stars, galaxies, objects affected by gravitational lensing. Method: cross-match catalogue with itself but with a much larger maximum offset than the error-regions. Problem: needs index generated using bounding boxes much larger than error-regions used for finding counterparts. May need additional function like XMATCH but with additional parameters e.g. for maximum offset.