Download presentation
Presentation is loading. Please wait.
Published byMorgan Haynes Modified over 9 years ago
1
Index Building
2
-2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules
3
Index Building -3--3- Database tables Word Index: Z97 - word dictionary Z98 - bitmap Z980 - cache of bitmap updates Z95 - words in document
4
Index Building -4--4- Database tables Z97 translation from word to internal representation (sequence) same character set as documents
5
Index Building -5--5- Database tables Z98 “bitmap” of word occurrence in documents each bitmap is physically made up of one or more records compressed one bitmap for every combination of word and index
6
Index Building -6--6- Database tables Z980 cache of bitmap updates increases speed of large bitmap updates 1/1000
7
Index Building -7--7- Database tables Z95 list of words and their location in a document adjacency
8
Index Building -8--8- Database tables Heading index: Z01 - phrase dictionary Z02 - phrase->document mapping
9
Index Building -9--9- Database tables Z01: filing phrase connection to authority database hash key (display text)
10
Index Building -10- Building flow - word Stage 1: Retrieval + Sort Read document prepare list of words and locations for each word find list of indices it belongs to sort according to words
11
Index Building -11- Building flow - word Stage 2: Word Dictionary read intermediate file from stage 1 build up word dictionary (check + load) replace word with internal representation create 2nd intermediate file
12
Index Building -12- Building flow - word Stage 3: Sort + Build Z95 sort intermediate file from stage 2 - by document number create Z95 records load Z95 sequential file to database
13
Index Building -13- Building flow - word Stage 4: Merge + Build Z98 intermediate file from stage 2 already sorted by word number split words into a number of files according to range of word numbers merge into Z98 records load sequential files
14
Index Building -14- Building flow - heading Stage 1: Retrieval + Sort Read document prepare list of phrases for each phrase find list of indices it belongs to sort according to hash key
15
Index Building -15- Building flow - heading Stage 2: Phrase Dictionary read intermediate file from stage 1 build up phrase dictionary generate unique key - acc sequence load Z01 sequential file to database build Z02 - non unique
16
Index Building -16- Building flow - heading Stage 3: Sort + Load Z02 sort non unique Z02 sequential file load Z02 sequential file to database
17
Index Building -17- Sequential - word Every stage is handled by a single process Only after handling by a previous stage would the next stage proceed stage 4 would proceed after all other stages were finished
18
Index Building -18- Sequential - word Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log
19
Index Building -19- Sequential - word p_manage_01_a: retrieval p_manage_01_b: sort (by word) p_manage_01_c: build Z97 p_manage_01_d: build Z95 p_manage_01_e: merge + build Z98
20
Index Building -20- Drawbacks Minimum parallel processing Single process per stage No recoverability - Z97 could be reused but the whole building process needed to be rerun Computer resources not fully utilized Long run time
21
Index Building -21- Parallel processing Large databases - multiple processors Identify stages that are not “workflow” bottlenecks Coordinate parallel processes with assignment/progress table
22
Index Building -22- Parallel processing (word) Stage 1: Retrieval + Sort Retrieval is parallel - “io” not “workflow” bottleneck Split into cycles of range document numbers
23
Index Building -23- Parallel processing (word) p_manage_01_a.cycles - initial 0001 - - - - 000000001 000010000 0002 - - - - 000010001 000020000 0003 - - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511
24
Index Building -24- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 1st retrieval cycle 0001 ? - - - 000000001 000010000 0002 ? - - - 000010001 000020000 0003 ? - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511
25
Index Building -25- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle 0001 + + ? - 000000001 000010000 0002 + ? - - 000010001 000020000 0003 + - - - 000020001 000030000 0004 ? - - - 000030001 000040000 0005 ? - - - 000040001 000050000 0006 ? - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511
26
Index Building -26- Parallel processing (word) Whenever possible stages were split into separate sub-stages Usually in cases of non-parallel stages stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage
27
Index Building -27- Parallel processing (word) Stage 2 and 3 were subdivided into the 3 sub stages: build Z97 + load sort intermediate file by document number build Z95 + load
28
Index Building -28- Parallel processing (word) p_manage_01_a.cycles - example 0001 + + + + 000000001 000010000 0002 + + + ? 000010001 000020000 0003 + + ? - 000020001 000030000 0004 + + - - 000030001 000040000 0005 + ? - - 000040001 000050000 0006 + - - - 000050001 000060000 0007 ? - - - 000060001 000070000 0008 ? - - - 000070001 000080000 0009 ? - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511
29
Index Building -29- Parallel processing (word) Stage 4 is split into sub stages: pre-processing of intermediate files from stage 2 - distribution of words build Z98 - parallel load Z98 sequential file input files are compressed and stored in separate directory
30
Index Building -30- Parallel processing (word) Pre-processing: generate histogram - # of lines per 5000 words determine range of words - no more than 1G in intermediate files
31
Index Building -31- Parallel processing (word) p_manage_01_e.cycles 0001 - - 000000001 000600000 0002 - - 000600001 000900000 0003 - - 000900001 999999999
32
Index Building -32- Parallel processing (word) Build Z98: intermediate files - split into discrete range of words parallel merging and building of Z98
33
Index Building -33- Parallel processing (word) p_manage_01_e.cycles - example 0001 + ? 000000001 000600000 0002 ? - 000600001 000900000 0003 ? - 000900001 999999999
34
Index Building -34- Parallel processing (heading) Stage 1: Retrieval + Sort same handling as word index stage 1 “io” bottleneck Split into cycles of range document numbers
35
Index Building -35- Parallel processing (heading) p_manage_02.cycles 0001 - - - - 000000001 000005000 0002 - - - - 000005001 000010000 0003 - - - - 000010001 000015000 0004 - - - - 000015001 000020000 0005 - - - - 000020001 000025000 0006 - - - - 000025001 000030000 0007 - - - - 000030001 000035000 0008 - - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435
36
Index Building -36- Parallel processing (heading) Stage 2 and 3 were subdivided into the 3 sub stages: build Z01 + load + build Z02 sort non unique Z02 sequential file load Z02
37
Index Building -37- Parallel processing (heading) p_manage_02.cycles - example 0001 + + + ? 000000001 000005000 0002 + + ? - 000005001 000010000 0003 + + - - 000010001 000015000 0004 + ? - - 000015001 000020000 0005 + - - - 000020001 000025000 0006 ? - - - 000025001 000030000 0007 ? - - - 000030001 000035000 0008 ? - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435
38
Index Building -38- Parallel processing (heading) Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)
39
Index Building -39- Recovery Word index: stages 1-3 and stage 4 are separate stage 4 runs only after all processing is done in stage 3
40
Index Building -40- Recovery Stage 1-3 - scenarios: database tables need to be enlarged not enough disk space - intermediate files not enough disk spaces - sort general disaster?
41
Index Building -41- Recovery Stage 1-3: identify last successful section change “in process” signs (?) to “not processed” sign (-) rerun discrete stage scripts: –p_manage_01_a –p_manage_01_c –p_manage_01_d –p_manage_01_d1
42
Index Building -42- Recovery Stage 4: must be rerun in totality input files are saved and compressed $word_compress_dir p_manage_01_e
43
Index Building -43- Helpful rules Stage 1 outrunning stage 2-3: decide on number of stage 1 processes to stop (p_manage_01_a) kill shell and program process reset associated cycle in p_manage_01_a.cycles
44
Index Building -44- Helpful rules Log file names: p_manage_01_a_{process_number}.log p_manage_01_e_{process_number}.log others are without process_number p_manage_01_c.log p_manage_01_d.log p_manage_01_d1.log p_manage_01_e1.log p_manage_01_e2.log
45
Index Building -45- Helpful rules cycle size: # docs<2M - 50k # docs<4M - 100k otherwise - 200k
46
Index Building -46- Helpful rules Disk space calculation: d = no. documents c = no. cycles p = no. processors s = size of retrieval file
47
Index Building -47- Helpful rules Sort space ($TMPDIR): sort = p*s + 20% stage 1 sort (parallel) + stage 2,3 sorting (single file)
48
Index Building -48- Helpful rules Scratch space: scratch =p*1.5*s + c*s*1/3 output from stage 1 (in process and not yet processed) + output from stage 3
49
Index Building -49- Helpful rules Example: UBU d=2M cycle size=50k p=4, c=40, s= ~0.5G sort=4*0.5*1.2=2.4G scratch=4*1.5*0.5 + 40*0.5*1/3 = 3G + 6.67G= 10.67G
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.