Spatial in Lucene and Solr David Smiley Lucene/Solr search developer / consultant 2016-05 at Harvard CGA.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
For Mapping Biodiversity Data Data Management Options.
Information Retrieval in Practice
Search Engines and Information Retrieval
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Modern Navigation Thomas Herring
Gmat 2700 Geometry of Coordinate Reference Systems Alexandra Lyle Student No Session 1, 2006 The Globe Presentation by Alexandra Lyle SCHOOL OF.
Intro. To GIS Lecture 6 Spatial Analysis April 8th, 2013
Map Projections Displaying the earth on 2 dimensional maps
Overview of Search Engines
Rebecca Boger Earth and Environmental Sciences Brooklyn College.
Map Projection & Coordinate Systems
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Microsoft Office System UK Developers Conference Radisson Edwardian, Heathrow 29 th & 30 th June 2005.
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Word Up! Using Lucene for full-text search of your data set.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Overview SQL Server 2008 Overview Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server MVP, MCTS Microsoft Web Development MCP ITIL.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.
Technical Workshops | Esri International User Conference San Diego, California ArcMap: Tips and Tricks Miriam Schmidts Jorge Ruiz-Valdepena July 23 – 27,
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Search Engines and Information Retrieval Chapter 1.
Quick Lesson on Databases Relational databases are key to managing complex data You’ve been using relational databases with “Joins” and “Relates” in ArcGIS.
Microsoft Access 2010 Building and Using Queries.
6. Simple Features Specification Background information UML overview Simple features geometry.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Exercise 1: Creating GIS data—points lines and polygons A very common method of creating vector data is to physically create these files through on-screen.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
We will complete another date search by entering 2008 to 2010 in the Specify date range option and clicking on Search.
Ontology Engineering and Plugin Development with the NeOn Toolkit Plug-in Development for the NeOn Toolkit June 1st, 2008 Michael Erdmann, Peter Haase,
Attributes in ArcGIS. ArcGIS Attributes FID – ESRI’s internal identifier Shape – Actual spatial data.
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical.
…and postgis & full text search & fuzzy comparisons.
Cartography: the science of map making A Round World in Plane Terms.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
John Pickford IBM H11 Wednesday, October 4, :30. – 14:30. Platform: Informix Practical Applications of IDS Extensibility (Part 2 of 2)
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Karen Cannell APEX: Tight Tabular Forms Karen Cannell
Proximity Spider Project by Ganesh Naikare Project Advisor: Professor Scott Spetka.
Lucene Jianguo Lu.
Query Models CSCI 572: Information Retrieval and Search Engines Summer 2010.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
#SummitNow Super Size Your Search 14 th November 2013 Fran Alvarez (Zaizi)
Geocoding Chapter 16 GISV431 &GEN405 Dr W Britz. Georeferencing, Transformations and Geocoding Georeferencing is the aligning of geographic data to a.
High performance, full-featured text search engine written in Java. Technology suitable for nearly any application requiring full-text search, especially.
Distributed Geospatial Indexing
A presentation on ElasticSearch
Information Retrieval in Practice
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Geographic Search & Display
Searching and Indexing
Querying for Metadata 13th November 2013 Andy Hind, Alfresco.
New free text search engine for
Building Search Systems for Digital Library Collections
April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof.
Rafał Kuć – Sematext sematext.com
Map Projections Displaying the earth on 2 dimensional maps
Working with GEOLocation Data
All about Indexes Gail Shaw.
Presentation transcript:

Spatial in Lucene and Solr David Smiley Lucene/Solr search developer / consultant at Harvard CGA

About David Smiley Software Engineer (16 years) Search (7 years) Java (full-stack), Web, Spatial Freelance search consultant / developer Expert Lucene/Solr search advise / training Expert Lucene/Solr development skills Apache Lucene / Solr committer & PMC, Eclipse Locationtech PMC Authored 1st book on Solr, updated twice Presents at conferences & meetups Taught several Solr classes, self-developed & LucidWorks

Agenda Search Background Spatial in Solr Features How-to Recent Lucene developments, Future

Search Technology Keyword search Text analysis: stemming, synonyms, tokenization, phonetics Relevance ordering Query-completion (Find As You Type) Query did-you-mean Highlighted snippets Faceting, for navigation & analytics Result Clustering Query operators like fuzzy match, and “near” operator Some major features…

Faceted Navigation & Analytics by example… Notice the counts Optionally start with a keyword search or filter Extremely useful feature supported by very few platforms: Solr, ElasticSearch, Sphinx, … (no DBs)

Search Platforms A search platform has search features plus others like: A query language Boolean logic, numerics & dates, regexp, standard sorting Joins & Grouping Configuration Horizontal scaling options Administration tools, incl. a UI Note: Crawlers (Web/file/content-repository) are sometimes separate A NoSQL solution of the search variety

Apache Lucene & Solr Lucene: Provides most of the search “technology” behind search, plus some non-search but important capabilities (e.g. dates & numbers) But it’s just a toolkit/library/framework Solr & ElasticSearch Adds everything else needed to have a search platform / server / NoSQL solution Add some more of its own search technology too (Lucene & ElasticSearch too)

Spatial in Solr

Geospatial Features Lucene/Solr can index text, numbers, dates, and spatial data Features: Index latitude & longitude coordinates or any X Y pairs Index polygons or other geometry Query by point-radius, rectangle, polygon, or other geometry Including “Within” vs “Intersects” vs “Contains” predicates 2d/flat Euclidean OR geodetic spherical world model Sort or relevancy-boost by distance to indexed points Heatmaps -- spatial grid faceting GeoJSON & WKT formats

Big Picture Different spatial field types to choose from Vary in what features they support Syntax can vary too  Vary in performance for different features Shapes (AKA geometry): Index a shape – put it in a document’s field Query by another shape The default relation predicate is “intersects” Spatial code lives in 4 places: Solr, Lucene (several modules), Spatial4j, JTS

How-to: Index Points (LatLonType) Configuration: schema.xml: Index a point (JavaScript syntax, “lat,lon” format): {"id":"1", "point":"45.15,-93.85"}

How-to: Index Polygons (RPT Type) Configuration: schema.xml: Index a polygon (JavaScript syntax around WKT): {"id":"1", "geo_rpt": "POLYGON((30 10, 10 20, 20 40, 40 40, 30 10))"} or any supported shape, even just points

How-to: Search/filter Search for documents intersecting a 5 kilometer circle at 45.15, : fq={!geofilt}&sfield=geo_rpt&pt=45.15,-93.85&d=5 Search for documents intersecting a lat-lon box (Range query style) fq=geo_rpt:[-90,-180 TO 90,180] Search for documents intersecting a polygon (WKT syntax) fq=geo_rpt:"Intersects(POLYGON((-10 30, , , 40 20, 0 0, ))) distErrPct=0” Predicates: Intersects, Within, Contains, Disjoint

GeoJSON examples (Solr 6.1) Schema: Index by GeoJSON (literal) {"type":"Point","coordinates":[1,2]} Search by GeoJSON, return GeoJSON: /select?q={!field f=geo_rpt Intersects({"type":"Point","coordinates":[1,2]}) &wt=geojson&geojson.field=geo_rpt {"response":{"type":"FeatureCollection", "numFound":1,"start":0,"features":[ {"type":"Feature”, "geometry":{"type":"Point","coordinates":[1,2]}, "properties":{... the normal solr doc fields here... }}] }}

How-to: Distance Sort / Boost Sort with geodist() &sort=geodist() asc &pt=45.15, &sfield=myField Relevancy boost This example is RPT only; alternatives exist for LatLonType &defType=edismax &boost=query($mysq) &mysq={!geofilt filter=false score=recipDistance pt=45.15, d=5} &sfield=geo_rpt Points-only

How-to: Index Rects (BBoxField) Configuration: schema.xml Index a rectangle (JavaScript syntax around WKT): {"id":"1", ”bbox”:"ENVELOPE(-10, 20, 15, 10)"} Note: minX, maxX, maxY, minY order

How-to: Filter and sort by overlap Use this syntax: &q={!field f=bbox score=overlapRatio} Intersects(ENVELOPE(-10, 20, 15, 10)) BBoxField has more precision than RPT Field and supports more predicates (e.g. Equals) BBoxField only

Heatmaps: Spatial Grid Faceting Spatial density summary grid faceting, also useful for point-plotting search results Lucene & Solr APIs Scalable & fast usually… Usually rendered with a gradient radius -> See: leaflet-solr-heatmap/example/index.htmlhttp://spacemansteve.github.io/ leaflet-solr-heatmap/example/index.html

How-to: Heatmaps On an RPT field Might customize prefixTree & worldBounds Query: /select?facet=true &facet.heatmap=geo_rpt &facet.heatmap.geom= [" " TO "180 90”] &facet.heatmap.format= ints2D or png // Normal Solr response... "facet_counts":{... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”, [null, null, [0, 1,... ]]...

New in Lucene Spatial (in 2015,2016; that which isn’t in Solr yet)

Geo3D: Shapes on a Sphere … or Ellipsoid of configurable axis Not a general 3D space geometry lib Internally uses geocentric X, Y, Z coordinates (hence 3D) with 3D planar geometry mathematics Shapes: Point, Lat-Lon Rect, Circle, Polygons, Path (LineString) with optional buffer Distance computations: Arc (angular or surface), Linear (straight-line), Normal

2D Maps Distort Straight Lines A straight bird-flies path from Anchorage to Miami doesn’t actually cross the ocean!

Geo3D, continued… Benefits Inherently more accurate than 2D projected spatial especially for big shapes or near poles Many computations are fast; no expensive trigonometry An alternative to JTS without the LGPL license (still) Has own Lucene module (spatial3d), thus jar file MavengroupId: org.apache.lucene, artifact: lucene-spatial3d Index/Search: Geo3DPoint & Geo3DDocValuesField Limited RPT & Spatial4j integration; see Geo3dShape No Solr integration yet; pending more Spatial4j integration

New Competing Spatial Fields GeoPointField, LatLonPoint, Geo3DPoint All of these: Naming is a challenge; don’t read into them too much Exist outside Lucene spatial-extra’s module Don’t use abstractions like SpatialStrategy or Spatial4j lib Worked on by various contributors Limited to indexed point data (not polygons, etc.) Note: in Lucene 4 & 5 there was one spatial module. In Lucene 6, that module was effectively renamed to “spatial-extras” with a new “spatial” module now, plus “spatial3d”.

New Fields continued… GeoPointField (in “spatial”) Supports distance sort/boost without a separate field Approximate grid index + docValues (2-phase iter impl) Geo3DPoint (in “spatial3d”) See Geo3D geometry slides earlier Uses new “BKD” PointValues index; 3 dimensions LatLonPoint (in “sandbox”) Most efficient Uses new “BKD” PointValues index; 2 dimensions

Performance Summary: LatLonPoint is currently 2x faster than other 2 (changes often) LatLonPoint has smallest index if don’t also need dist. sorting If need that (i.e. need “docValues”), GeoPoint is smallest No sort perf comparison yet; Geo3D looks promising Comparison to RPT (in spatial-extras): RPT similar to GeoPoint in search performance RPT’s indexes are huge Remember: RPT supports index based heatmaps & non-point indexed shapes (and predicates), and custom shapes

Future The dust hasn’t settled in Lucene spatial land… lots of activity lately, lots of performance enhancements Need to add Solr adapters Some Solr spatial ease-of-use / consistency / better docs would be good Heatmap performance planned/funded Heatmap with stats (instead of counts) planned/funded