Project Description 2 Inverted List Database
Create an Inverted File Tokenize a text document, and attach to each token a list of locations that this token has appeared Sort and Store these result in Oracle database
Tokenizer –Admissible symbols for token; we will not user delimiter to capture the token. –Keep a record of the position of each token
Tokenizer Example: Document1: He is a dumb teacher Dumb! Dumb! and Dumb! Document2:“He is a great council. His advices are really great. He truly helps.
Tokenizer Inverted File for document 1: -continue: dumb 4 Dumb 6 Dumb 8 Dumb 11 He 1 is 2 teacher 5
Tokenizer - Example: Inverted File for document 1: ! 12 ! 7 ! 9 a 3 and 10
Tokenizer Inverted File for document 1 ! 7, 9, 12 a 3 and 10 Dumb 4, 6, 8, 11 He 1 is 2 teacher 5
Tokenizer Inverted File for document 2 : (period). 6, 12 a 3 advices 8 are 9 council 5 great 4, 11 He 1. 7 is 2 really 10
Token database Store the token into database First Column is sorted tokens Second Column is the Document Name/NO Rest of a tuple keeps locations of the token This is the so called inverted list –(option) Compressed the sequence of locations into some new data type.