Download presentation
Presentation is loading. Please wait.
Published byKristian Park Modified over 6 years ago
1
AutoSuggest This is for ELM Ralph LeVan Sr. Research Scientist
7/14/2016 Code4Lib Midwest
2
Goals Return records at keystroke speeds
Run on an underpowered Unix box
3
Result Precalculate a response record for every possible legitimate keystroke combination Load those records into a Pears database and expose via SRW Client javascript takes keystrokes and turns them into queries to an AutoSuggest servlet The thin gateway servlet takes queries, turns them into SRW requests and passes through the record returned
4
How are the records precalculated?
For each source record, a relevance score is calculated For VIAF, that’s a value in the record Names are extracted from the record. The names are ranked The best name gets the score of the record and subsequent names get a reduced score For each name, a tuple is generated containing the name, the recordID of the source record, the score for the name and any other data extracted from the record
5
How are the records precalculated?
The tuples are sorted A process reads in all the names that start with the same letter. The first two terms are compared and a top-10 list is started for each set of letters in common E.g. Andrew and Anthony each go into the top-10 list for A and AN. AutoSuggest records are generated for the singletons Andrew and Anthony. The full name is the key for these records.
6
How are the records precalculated?
The next term is compared to the one that preceeded it E.g. Anthony and Astrid are compared Astrid is added to the top-10 list for A An AutoSuggest record is written for the AN list The key for the record is AN Each of the names (and associated data) are included in the record An AutoSuggest record is generated for the singleton Astrid
7
Top-10 is complicated The naïve assumption is that the 10 names with the highest score would be in the list But, all the variations on Shakespeare that start with S would be in the S record. So, a candidate name for the top-10 list is checked to see if there is a higher ranking name with the same recordID before it is added
8
It’s not really that easy
All the names that start with A won’t fit into memory. We do all of this work in Hadoop We partition the tuple input on the first 5 letters in common Process as described before, but write the shorter fragments (less than 5 letters) to a separate directory Combine those lists to produce unified lists (and records)
9
Loaded into Pears All these generated records are loaded into Pears
Lots and lots of records The latest AutoSuggest database for VIAF has 341 million records in it. VIAF itself only has 31M records
10
Ralph LeVan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.