Spatial Issues in DBGlobe Dieter Pfoser
Location Parameter in Services Entering the harbor (x,y position)… …triggers information request
Spatial Data in DBGlobe n Spatial information might be the predominant type of data to structure information content n PMOs contain spatially (+temporally) referenced data n These data is distributed over a set of devices n How can we relate all these data to one spatial location “What have we stored for this location?” n This introduces space as the organizing criterion for data, i.e., a distinguished context
Spatial Data… (cont’d) n Each PMO contains a set of positions that reference content n The job of DBGlobe is now to find this content based on a given positional reference Position { PMO (id)} content n BUT! – Content is referenced by position as the only argument! – The question is of how to introduce further filters that only retrieve relevant (interesting) content based on additional parameters?
Distributed Indexes n Using tree-based structures, a global index needs to be constructed and some portion of the index replicated in the CAS n Given the set of locations for each PMO, one could compute a signature that the PMO communicates to a CAS (and further aggregated there) n This signature is used to potentially scan all PMOs for relevant spatial information
Bloom Filters: High Level Idea n Everyone thinks they need to know exactly what everyone else has. Give me a list of what you have. n Lists are long and unwieldy. n Using Bloom filters, you can get small, approximate lists. Give me information so I can figure out what you have.
A Bloom Filter: To check an object’s name against a Bloom filter summary, the name is hashed with n different hash functions (here, n=3) and bits corresponding to the result are checked. Bloom Filter Example Bit Vector Hash Functions
Bloom Filter n Multiple hash functions used for mapping of values on bit vector n Example: Web proxy cache sharing – Hashing URLs using the MD5 algorithm, which is a cryptographic message digest algorithm that hashes arbitrary length strings to 128 bits – Hash functions are built by n first calculating the MD5 signature of a URL 128 bits n dividing the 128 bits into four 32-bit word, and finally n taking the modulus of each 32-bit word by the table size
Spatial Hashing n Alphanumeric hashing, string hash value n Spatial coordinates as string? – (Long/Lat) deg. East, deg. North – Equal to – deg. East, deg. North ??? n Hashing the two pairs of coordinates as strings their hash values would not match (be totally different, given a good hash function such as MD5) n Spatial data is different from alphanumeric data since its semantics have to be seen in the context of a reference system n In the context of matching hash values tolerance is needed to test for equality
Spatial Subdivisions n Regular subdivisions n Occupation-based, e.g., adaptive k-d-tree
Spatial Subdivision Earthquakes Earthquakes n Computing spatial subdivisions of space based on existing data
Spatial Hashing n Linearize the spatial subdivisions using space filling curves n Space filling curves as hash functions – Z-ordering (Peano curves) – Hilbert curves – … n Example: – Hashing positions using the above space-filling curves – Determine the spatial subdivision the position falls into – Compute respective linearization values for each of the space filling curves (hash functions) – taking the modulus of each value by the size of the bit vector
n PMO containing spatial data communicate signatures to CAS n CAS “ORs” signatures and keeps track of associations Overall Scenario
Questions n Types of queries, e.g., range queries vs. “point” queries n Spatial hash functions by using grids and space filling curves n Distinct type of data that deserves special treatment? n Can stand as a single query parameter? Needs more context?
END
n Given a set S = {x 1,x 2,x 3,…x n } on a universe U, want to answer queries of the form: n Example: a set of URLs from the universe of all possible URL strings. n Bloom filter provides an answer in – “Constant” time (time to hash). – Small amount of space. – But with some probability of being wrong. Lookup Problem
m/n = 8 Opt k = 8 ln 2 = 5.45 Optimal Choice of Parameters n Given m bits for filter and n elements, choose number k of hash functions n Find optimal at k = (ln 2)m/n by calculus
Spatial Subdivision