Download presentation
Presentation is loading. Please wait.
Published byDaniela Wheeler Modified over 9 years ago
1
Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software
2
Collections Management Museums EMu Searching Overview The basic theory Tools and tuning Searching issues
3
Collections Management Museums EMu Searching EMu search mechanism Two level superimposed coding scheme for partial match retrieval Developed from research at the University of Melbourne (early 1980s) Designed to provide very high speed retrieval from very large datasets The more search terms provided, the faster the search time One set of indexes for all searching (except key searches)
4
Collections Management Museums EMu Searching Record Descriptor Encodes the contents of one record into a single bit string Descriptors stored sequentially in the rec file Each record descriptor has the data offset (from the data file) appended rec descriptor 1offset rec descriptor 2offset rec descriptor 3offset rec descriptor 4offset rec descriptor 5offset rec filedata file record data 1 record data 3 record data 2
5
Collections Management Museums EMu Searching FieldTermsBits set (k = 2)Descriptor (b = 15) First NameBoris3,1000010 00000 10000 SurnameBadenov1, 401001 00000 00000 CityFrostbite Falls 3, 7 8, 14 00010 00100 00000 00000 00010 00001 CountryPottsylvania4, 900001 00001 00000 Rec Descriptor01011 00111 10001 term pseudo random number generator bit numbers kbcolumn no
6
Collections Management Museums EMu Searching Record descriptor (searching) Generate record descriptor for search term(s) AND with all record descriptors to find matching record(s) FieldTermsBits set (k = 2)Descriptor (b = 15) First NameBoris3,1000010 00000 10000 Query Descriptor00010 00000 10000 Boris query descriptor 01011 00111 10001ANDrecord descriptor 00010 00000 10000resultant descriptor
7
Collections Management Museums EMu Searching False matches Query descriptor matches a record descriptor that does not contain the search term FieldTermsBits set (k = 2)Descriptor (b = 15) First NameNatasha7, 900000 00101 00000 Query Descriptor00000 00101 00000 Natasha query descriptor 01011 00111 10001ANDrecord descriptor 00000 00101 00000resultant descriptor
8
Collections Management Museums EMu Searching False matches Chance of a false match related to bit density The lower the bit density, the less probability of a false match EMu uses a bit density of < 25%; that is, less than 25% of bits are one Probability of a false match with k = 5 is 1 in 1,024 record descriptors checked for a single term query Probability for a two term query 1 in 1,048,576 Lower bit density requires more disk space and produces longer record descriptors
9
Collections Management Museums EMu Searching Segment descriptor Encodes the contents of multiple records into a bit string Descriptors stored sequentially in the seg file (bitsliced) rec descriptor 1 rec descriptor 2 seg descriptor 1rec descriptor 3 rec descriptor 4 rec descriptor 5 rec descriptor 6 seg descriptor 2rec descriptor 7...
10
Collections Management Museums EMu Searching Segment descriptor For each group of records (Nr) a single descriptor is calculated as for a record descriptor Segment level has its own values for k (number of bits to set) and b (length of bit string)
11
Collections Management Museums EMu Searching Segment descriptor (searching) Segment searching checks Nr records per descriptor For efficient disk access for searching, “flip” seg file (bitslicing) Penalty is slower record insertions / updates (use oflow file) 00001 00000 00100 00000 01000seg query descriptor 10011 00010 00111 00001 11001seg descriptor 1 00011 10000 00001 01100 00100seg descriptor 2 01000 00110 11000 00011 01001seg descriptor 3 01001 00100 01100 00101 01000seg descriptor 4
12
Collections Management Museums EMu Searching Segment descriptor (bitsliced) 1000 … 0011 … 0000 … 1100 … 1101 … 0100 … 0000 … 0011 … 1010 … 0000 … 0010 … 0011 … 1001 … … 1001 … AND Each bit slice is ANDed to determine matching segments Matching segments are given by bit positions with a value of one
13
Collections Management Museums EMu Searching Complete search sequence Build segment query descriptor for query terms Search bitslice segment file for list of matching segments Build record query descriptor for query terms Search record descriptors in matching segments for matching records Exact match record only before showing to user
14
Collections Management Museums EMu Searching Number of disk accesses (logical) For a single search term with one matching record: ks – bits set per term (segment level) 1 – disk read to read segment to match record descriptor Number of logical reads is independent of the table size Number of physical reads increases as table grows (but disk read ahead helps here)
15
Collections Management Museums EMu Searching Client query evaluation Attachment searches performed and matching IRNs on reference column added to query statement Reverse attachment searches performed and matching reference values added to query statement Local search terms added to query statement Also search columns added to query statement Search performed
16
Collections Management Museums EMu Searching What is a term? TypeTermQuery examples TextwordFrostbite, falls Floatnumber9.12 Integernumber12 Dateday, month, year12-10-2010 Timehour, min, sec13:12:10.0 Lat/Longdeg, min, sec, dir120 12 10.43 N StringvalueA1-124/7 A term is the basic index component
17
Collections Management Museums EMu Searching Term modifiers ModifierApplicable typesQuery examples Nullall types*, !* Partialtext, stringab*, a{a-z}* Stemtext~electric Phonetictext@smythe Phrasetext“Red house” Modifiers alter how the term is indexed
18
Collections Management Museums EMu Searching Indexing tools texdensity Prints out the bit density for segment and record descriptors texanalyse Prints the number of terms per record texconf Calculate a suitable index configuration Adjust configuration parameters manually
19
Collections Management Museums EMu Searching Configuration parameters params file in table directory Override default configuration parameters Bit density (rec/seg) File system block size False match probability (rec/seg) Minimum number of records per segment XML based file
20
Collections Management Museums EMu Searching Searching Issues – false matches Issue Some queries are slow but disk activity is high Diagnose texadmin database usage shows a high number of index false matches texdensity shows high density or large standard deviation with high maximum density (check seg and rec) texanalyse shows a large standard deviation for the number of index terms (check seg and rec) Fix Reconfigure table Set configuration parameters manually
21
Collections Management Museums EMu Searching Searching Issues – common terms Issue Some queries containing common terms are slow “false” segment matches Diagnose Querying on each term individually results in a large number of matches (query is quick) Querying on the combination of terms becomes slow Fix Cluster table on a common term Sort data before indexing
22
Collections Management Museums EMu Searching Searching Issues – block size mismatch Issue Overall searching is slow but disk activity is high Using zfs with large record size Diagnose Determine the block size of the file system used to hold index files Use texconf to determine the block size used for indexing Fix Set blocksize configuration parameter manually Adjust zfs record size to 16K
23
Collections Management Museums EMu Searching Searching Issues – RAID configuration Issue Record updates are very slow Fast disks but performance less than optimal Diagnose Disk controller or driver is configured to use RAID 5 or 6 Fix Optimal performance in a RAID environment is RAID 1+0 (RAID 10) (stripe/mirror) Ensure striping agrees with block size of file system Enable striping where possible
24
Collections Management Museums EMu Searching Searching Issues – Unindexed fields Issue Wildcard / stem / phonetic based queries are extremely slow Diagnose Use emuindexing to check indexing of fields being queried Fix Add Registry entries to enable indexing required: System|Setting|Table|table|Stem Index|colname;colname;... System|Setting|Table|table|Phonetic Index|colname;colname;... System|Setting|Table|table|Null Index|colname;colname;... System|Setting|Table|table|Partial Index|colname=parts;...
25
Collections Management Museums EMu Searching Searching Issues – Range queries slow Issue Queries containing ranges are slow Diagnose Use emuindexing to check if range indexing is enabled Fix Use emurangeupdate to optimise range based searching Add Registry entries to enable indexing required: System|Setting|Table|table|Range Buckets|colname|bucket;...
26
Collections Management Museums EMu Searching Searching Issues – Large attachment queries Issue Query is very slow when performing a query containing attachments and other terms Diagnose “Optimising query” status is displayed for a long time Cause The search engine is re-organising the query (a AND b) AND (c OR d OR e OR f or g) becomes (a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or (a and b and f) OR (a AND b AND g) Fix Rewrite the query optimiser
27
Collections Management Museums EMu Searching References EMu 4.0.01 Release Notes System Tuning Configuration Range Indexing www.kesoftware.com/downloads/EMu/documents/configuration.pdf www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.