When worlds collide Metasearching meets central indexes Mike Taylor – Index Data –
Search When worlds collide : metasearching and central indexes Mike Taylor –
Search When worlds collide : metasearching and central indexes Mike Taylor –
Search When worlds collide : metasearching and central indexes Mike Taylor – Data
Search When worlds collide : metasearching and central indexes Mike Taylor – Data Problem solved!
Search When worlds collide : metasearching and central indexes Mike Taylor – Data ??
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching 360 Search EHIS (EBSCO) MetaLib
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching 360 Search EHIS (EBSCO) MetaLib Pazpar2 (Open source)
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data A.K.A. federated search Searching
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data A.K.A. federated search A.K.A. distributed search Searching
Metasearch When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data A.K.A. federated search A.K.A. broadcast search A.K.A. distributed search Searching ?
Back to the sad searcher When worlds collide : metasearching and central indexes Mike Taylor – Data ??
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting Summon WorldCat Primo Central
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting Summon WorldCat Primo Central MasterKey
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting A.K.A. local index
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting A.K.A. local index A.K.A. discovery services
Central index When worlds collide : metasearching and central indexes Mike Taylor – Data Fat database Harvesting A.K.A. local index A.K.A. vertical search A.K.A. discovery services ?
We need a controlled vocabulary! When worlds collide : metasearching and central indexes Mike Taylor – Metasearch = Federated search = Distributed search = Broadcast search Central index = Local index = Discovery services = Vertical search (if you ever heard anything so dumb)
Which approach is better? When worlds collide : metasearching and central indexes Mike Taylor – Central indexing compared with metasearching: - requires harvesting infrastructure - requires lots of local storage - requires co-operation from services to be harvested - does not have access to all searchable data - will always be somewhat out of date - is faster at search time (or SHOULD be) - allows data to be normalised (e.g. dates extracted) - allows for better relevance ranking - can provide pre-baked facets - may have access to some data that not searchable
Which approach is better? When worlds collide : metasearching and central indexes Mike Taylor –
Which approach is better? When worlds collide : metasearching and central indexes Mike Taylor –
Which approach is better? When worlds collide : metasearching and central indexes Mike Taylor –
Which approach is better? When worlds collide : metasearching and central indexes Mike Taylor – Let's do both!
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting ! Integrated Search
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting ! Integrated Search
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting ! Integrated Search
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting ! Integrated Search
Metasearch hides the complexity When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching
Metasearch Nine tenths under The surface When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching
Metasearch What you see looks beautiful When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching
Problems that need solving When worlds collide : metasearching and central indexes Mike Taylor – A. Problems with pure metasearching B. How those problems change when you add a central index
Problems with metasearching When worlds collide : metasearching and central indexes Mike Taylor – Examples based on Index Data's suite: Pazpar2 is a free metasearching engine with a stupid name MasterKey is a non-open suite that wraps it MasterKey is only one way to use Pazpar2 Also integrated into other vendors' UIs.
Problems with metasearching #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI Must be made available via a standard protocol
Problems with metasearching #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI Must be made available via a standard protocol Option 1: build a gateway in Perl
Problems with metasearching #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI Must be made available via a standard protocol Option 1: build a gateway in Perl Option 2: MasterKey Connect (non-open)
Problems with metasearching #2: data server is crap^H^H^H^Hsuboptimal When worlds collide : metasearching and central indexes Mike Taylor – Catalogs searchable using ANSI/NISO Z39.50 Support is very nominal in some cases
Problems with metasearching #2: data server is crap^H^H^H^Hsuboptimal When worlds collide : metasearching and central indexes Mike Taylor – Catalogs searchable using ANSI/NISO Z39.50 Support is very nominal in some cases IRSpy probes behaviour MasterKey target profiles describe behaviour
Problems with metasearching #3: Data servers don't support relevance When worlds collide : metasearching and central indexes Mike Taylor –
Problems with metasearching #3: Data servers don't support relevance When worlds collide : metasearching and central indexes Mike Taylor – Pazpar2 does its own relevance ranking (Part of merging/deduplication)
Problems with metasearching #4: Data servers don't return facets When worlds collide : metasearching and central indexes Mike Taylor –
Problems with metasearching #4: Data servers don't return facets When worlds collide : metasearching and central indexes Mike Taylor – Pazpar2 calculates its own facets
There is a lot of magic in the magic box Searching Sorting Merging Deduplication Relevance Facet generation Time travel... When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data
There is a lot of magic in the magic box Searching Sorting Merging Deduplication Relevance Facet generation Time travel... When worlds collide : metasearching and central indexes Mike Taylor – Pazpar2 Data Remember, our engine is free:
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting ! What happens when we add a central index?
Problems with integrated search #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI
Problems with integrated search #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI
Problems with integrated search #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI You can't harvest Google
Problems with integrated search #1: No data server at all! When worlds collide : metasearching and central indexes Mike Taylor – Data is often only in a user-facing Web UI You can't harvest Google You just can't
Problems with integrated search #2: data server is crap^H^H^H^Hsuboptimal When worlds collide : metasearching and central indexes Mike Taylor – Repositories harvestable using OAI-PMH (an even worse name than pazpar2) Support is very nominal in some cases
Problems with integrated search #2: data server is crap^H^H^H^Hsuboptimal When worlds collide : metasearching and central indexes Mike Taylor – Repositories harvestable using OAI-PMH (an even worse name than pazpar2) Support is very nominal in some cases OAI-PMH client must be very tolerant Extensive data-cleaning is usually required
Problems with integrated search #3: Central index does support relevance When worlds collide : metasearching and central indexes Mike Taylor – Returned records carry relevance scores Must be merged with records scored by engine Requires score normalisation into same range Existing ordering may be used in merge
Problems with integrated search #3: Central index does support relevance When worlds collide : metasearching and central indexes Mike Taylor – Unranked #1 Ranked #1 Ranked #2 Solr Sort Merged Unranked #2 Sort
Problems with integrated search #4: Central index does return facets When worlds collide : metasearching and central indexes Mike Taylor – Lists of field values with occurrence counts: Author Kernighan 27 Pike 13 Ritchie 7 Thompson 4 Title C 7 Unix 35 Programming 16 Date
Problems with integrated search #4: Central index does return facets When worlds collide : metasearching and central indexes Mike Taylor – Lists are returned or calculated for each server: Server 1 (central index) (all facets from 2000 hits) Cat 68 Dinosaur 162 Fish 145 Frog 19 Server 2 (metasearch) (1000 hits, 100 records) Cat 7 Dog 10 Dinosaur 87 Fish 23
Problems with integrated search #4: Central index does return facets When worlds collide : metasearching and central indexes Mike Taylor – Metasearched counts normalised by total hit-count Server 1 (central index) (all facets from 2000 hits) Cat 68 Dinosaur 162 Fish 145 Frog 19 Server 2 (metasearch) (normalised to 1000 hits) Cat 70 Dog 100 Dinosaur 870 Fish 230
Problems with integrated search #4: Central index does return facets When worlds collide : metasearching and central indexes Mike Taylor – Facet lists are merged Servers 1+2 (integrated) (as though for all records in result sets) Cat = 138 Dog = 100 Dinosaur = 1032 Fish = 375 Frog 19+0 = 19
Problems with integrated search #4: Central index does return facets When worlds collide : metasearching and central indexes Mike Taylor – Fringe benefit: facet-count normalisation is also useful when doing pure metasearching. Servers 1+2 (as though for all records in result sets) Cat = 138 Dog = 100 Dinosaur = 1032 Fish = 375 Frog 19+0 = 19
Summary of search issues When worlds collide : metasearching and central indexes Mike Taylor – Issue Metasearch solution Central index solution No data server Build gateways MasterKey Connect --- Bad data server Probe capabilities Profile targets Tolerant harvester Data-cleaning Relevance scores Magic engine Normalise scores Ingest from server Facets Magic engine Normalise counts Ingest from server
When worlds collide : metasearching and central indexes Mike Taylor – Magic box Data Searching Data Fat database Harvesting
When worlds collide Metasearching meets central indexes Mike Taylor – Index Data –