Algorithms for Large Sets of Points Dave Abel CSIRO ICT Centre.

Slides:



Advertisements
Similar presentations
Pattern Matching against Distributed Datasets within DAME Andy Pasley University of York.
Advertisements

Scalability, from a database systems perspective Dave Abel.
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Indexing DNA Sequences Using q-Grams
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Why, what were the idea ? 1.Create a data infrastructure, 2.Data + the knowledge products that are produced on the basis of data a) Efficiant access to.
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
Searching Algorithms Finding what you are looking for.
Mining Graphs.
Spatial Mining.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
1 Advanced Data Structures. 2 Topics Data structures (old) stack, list, array, BST (new) Trees, heaps, union-find, hash tables, spatial, string Algorithm.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Efficient Case Retrieval Sources: –Chapter 7 – –
Evaluations and recommendations for a user support toolkit Christine Cahoon George Munroe.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
Objectives of the Lecture :
Advanced Algorithm Design and Analysis (Lecture 9) SW5 fall 2004 Simonas Šaltenis E1-215b
A summary of feedback from service users and carers: Adult Social Care – what does good look like?
1 Geometric Intersection Determining if there are intersections between graphical objects Finding all intersecting pairs Brute Force Algorithm Plane Sweep.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
CS 376b Introduction to Computer Vision 04 / 29 / 2008 Instructor: Michael Eckmann.
CMSC 345 Fall 2000 Unit Testing. The testing process.
Decision Support Systems Chapter 13 What is Decision Support Systems (DSS)? A computer based program for solving structured and semi-structured problems.
CS654: Digital Image Analysis Lecture 3: Data Structure for Image Analysis.
CS461: Principles and Internals of Database Systems Instructor: Ying Cai Department of Computer Science Iowa State University Office:
Database Management 9. course. Execution of queries.
CSED Computational Science & Engineering Department CHEMICAL DATABASE SERVICE The Current Service is Well Regarded The CDS has a long and distinguished.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
2003 May 24Clive Page Implementation of XMATCH function.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
1-1 Software Development Objectives: Discuss the goals of software development Identify various aspects of software quality Examine two development life.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #12 2/20/02 Evolutionary Algorithms.
TDWG – Looking Backward and Forward Donald Hobern, Director, Atlas of Living Australia 20 October 2008.
CS223: Software Engineering Lecture 19: Unit Testing.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Planning Database Design Normalization.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Data mining in web applications
Presented by: Omar Alqahtani Fall 2016
Best practices ideas for improvement of industrial return
Introduction to Query Optimization
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
CMPS 3130/6130 Computational Geometry Spring 2017
Sameh Shohdy, Yu Su, and Gagan Agrawal
Unit Test Pattern.
Examples of Physical Query Plan Alternatives
G063 - Distributed Databases
Lesson 15: Processing Arrays
Team Project, Part II NOMO Auto, Part II IST 210 Section 4
Selected Topics: External Sorting, Join Algorithms, …
Self-organizing Tuple Reconstruction in Column-stores
Distributed Handling of large Level of Detail Surfaces
CO4301 – Advanced Games Development Week 12 Using Trees
Algorithms and Data Structures Lecture XIV
File Organizations and Indexing
Presentation transcript:

Algorithms for Large Sets of Points Dave Abel CSIRO ICT Centre

Who are we?  CSIRO: Australia’s govt-funded r&d agency. About 6500 staff.  ICT Centre: the ICT unit. About 170 staff;  eScience: data grids, workflow. About 6 people.

Why choose this topic?  Solution times of days were a challenge.  Joins are clearly a core problem in data integration;  ‘Must solve’ to tackle data fusion in collections pf databases contributed by members of a community of interest.

Roadmap  2 case studies:  Catalogue matching;  Neighbour finding.  Work in progress:  More point operations;  Searching unstructured collections.

Not the same problem …  Catalogue Matching  Determine equivalent pairs of bodies in two catalogues, on the basis of imprecise locations;  Not quite a spatial join.  Neighbour Finding  Determine, within a data set, the pairs of objects whose asserted positions are less n arcseconds apart  Aka fixed-radius all-neighbors problem.

Basis of Algorithm  Deal with points from ‘Test’set in ascending sequence by declination;  Maintain an ‘active list’ of points from the other set, that could possibly match the current point from Test or points still to come;  Event is maintenance of active list, and searching within it for a match.

Basis for Algorithms  Filter and Refine  But dec and ra in turn;  Plane Sweep (on the Sphere)  Structure algorithm as processing for regular events;  Force regular events by processing in order by declination.

The Active List Active SetTest Set T[dec] T[dec]  (max  a 2 +  t 2 ) T[dec] – 1.96  (max  a 2 + max  t 2 )

Maintaining The Active List Active SetTest Set previous current

Testing Matches  The update discipline gives a fair first-pass test on declination;  Test members against current point in terms of range of right ascension;  The list (double-ended queue) is typically small:  ‘Scan all’ is good, usually;  Apply a binary tree as an index on ra for large lists (0.5B x 0.5B matches, or high imprecisions).  Then test by angular distance.

Some empirical tests

For the sake of completeness …  Also report the unmatched objects from the two sets;  Allow object-by-object imprecisions;  Test by confidence levels on the angular separation by reference to the imprecisions of the two objects. But all of these are easily parameterised.

Neighbour finding T[dec] + X Same approach: Current is tail of the active list; Test by ra to generate candidates; Test candidates in terms of angular separation. T[dec]

Some Preliminary Results

Work in progress  Implementations  Implement as a db stored procedure?  How best to implant in distributed db?  Where will it work?  Essentially the approach is to localise operations within a batched problem;  Exploits, and is dependant on, some domain-specific aspects;  Evaluation of local clusters should fall into this class; 1 is lucky, 2 is intriguing, 3 says there could be something worthwhile.

Other Work in Progress  Data mining, as searching for bodies with a certain signature;  Can we trawl an unstructured collection of VO data nodes?  Without a global schema?  What provisions for inconsistency?  Encouraging sound science?  Performance?

More Information Code and reports available real soon now.