Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report.

Similar presentations


Presentation on theme: "The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report."— Presentation transcript:

1 The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report

2 Topic of our term project Compare the performance of the algorithms used in information retrieval. Compare the performance of the algorithms used in information retrieval. On the basis of that comparison, make efficient search engine and demonstrate it. On the basis of that comparison, make efficient search engine and demonstrate it.

3 Procedures  Extracting the text-information ’ s position from raw files.  Extracting the keyword or index from the text.  Making the index file.  Gathering and sorting those index file  Getting information of index.  Boolean retrieval  Natural language retrieval using Vector and Probability model.

4 Procedure (I)-1 Raw document: putting together into a file from HTML files. Raw document: putting together into a file from HTML files. ex) ex) … document … … document … … document …. … document ….  Get the text information by string match algorithm.

5 Procedure (I)-2 Tuned Boyer-Moore Algorithm Tuned Boyer-Moore Algorithm BalkParcMoraPark BalkParcMoraPark Park Park Park Park Park Modified from Boyer-Moore Algorithm Modified from Boyer-Moore Algorithm Using the bad-character shift function Using the bad-character shift function Easy to applying Easy to applying Can search in a 1/3 times to the general search algorithm Can search in a 1/3 times to the general search algorithm

6 Procedure (II) Statistical information from the extracted text Statistical information from the extracted text  The result contain - average text length - average text length - total the number of the text - total the number of the text - average text file from a document - average text file from a document  This information do not be used in analyzing the search engine directly

7 Procedure (III) Making temporary index Making temporary index There are a number of making index word. There are a number of making index word. Exclude stopword from index word Exclude stopword from index word Ex) Stopword : “ the ”, “ of ”, “ and ”, “ to ” Stored in AVL tree Stored in AVL tree AVL tree enables the machine to insert or delete nodes and help to search efficiently. AVL tree enables the machine to insert or delete nodes and help to search efficiently.

8 Procedure (IV)-1 Gathering and getting information of index terms. Gathering and getting information of index terms. Document index consists of a pair of index from document and location which that index word appeared. Document index consists of a pair of index from document and location which that index word appeared. That location information is pointed to lexicon and posting. That location information is pointed to lexicon and posting.

9 Procedure (IV)-2 Sample document Document No. Contents 1 Peace porridge hot, peace porridge 2 Peace porridge in the hot 3 Nine days old 4 Some like it hot, some like it 5 Some like it in the pot 6 Nine days old Lexicon file Posting file Posting fileCold2Days2 Hot2 in21,43,6 1,4 2,5 cold cold

10 Typical information retrieval Boolean model Boolean model - set model, express query and express as a set - set model, express query and express as a set - “ not ”, “ or ”, “ and ” - “ not ”, “ or ”, “ and ” - easy to understand but difficult for user to use - easy to understand but difficult for user to use Vector model Vector model - assign weighted value to index - calculate the similarity and rank the result - Most popular model Probability model Probability model - Robertson & Sparck Jones suggest in 1976 - Based on probability and Bayes ’ theorem

11 Until now ….& next Extract information from raw-files. Extract information from raw-files. Extract the keyword and index word. Extract the keyword and index word. Be making index file and lexicon/posting Be making index file and lexicon/posting Will survey model (boolean, vector, probability) Will survey model (boolean, vector, probability) Will make engine consists of three part (according to 3 model) Will make engine consists of three part (according to 3 model) Compare their performance and suggest simple engine. Compare their performance and suggest simple engine.

12 Development system System: System: Pentium 4 (1.6G), XP window Pentium 4 (1.6G), XP window OS: OS: Red hat-linux on VM ware Red hat-linux on VM ware Interface: Interface: Execute on console line Execute on console line Text-based result Text-based result


Download ppt "The Development of a search engine & Comparison according to algorithms 20032017 Sungsoo Kim 20032066 Haebeom Lee The mid-term progress report."

Similar presentations


Ads by Google