Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keys to Building a Multilingual Search Engine Thierry Sourbier.

Similar presentations


Presentation on theme: "Keys to Building a Multilingual Search Engine Thierry Sourbier."— Presentation transcript:

1 Keys to Building a Multilingual Search Engine Thierry Sourbier

2 v Client-Side (browser) How to make the best use of the browsers when dealing with multiple languages v Server-Side How to provide efficient multilingual information retrieval v Server-Side How to provide efficient multilingual information retrieval Submit query Display results Process query HTTP Create index Search Engine Overview

3 Thierry Sourbier Overview of the Server-side v Index creation steps: Normalization gives the pages a standard format Segmentation breaks the pages in units that will be stored in the index Index building v Index creation steps: Normalization gives the pages a standard format Segmentation breaks the pages in units that will be stored in the index Index building v Query processing steps: Normalization makes sure that the query has the same format as the indexed pages Segmentation breaks the query in units that will be looked up in the index Index search Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.

4 Thierry Sourbier Multilingual Normalization v Normalizing the character encoding One size fits all: Unicode v Removing the unnecessary HTML tags, extra white spaces, etc. v Character normalization Mapping together characters that have the same meaning Locale dependent v Normalizing the character encoding One size fits all: Unicode v Removing the unnecessary HTML tags, extra white spaces, etc. v Character normalization Mapping together characters that have the same meaning Locale dependent

5 Thierry Sourbier Multilingual Segmentation v Linguistic features cant be used Too complex and/or costly to implement v Relying on N-Gram N-Gram = a sequence of N contiguous characters N-Gram may overlap example with N=4 unicode conference => unic,nico,icod,code,de c,e co, con,... v Linguistic features cant be used Too complex and/or costly to implement v Relying on N-Gram N-Gram = a sequence of N contiguous characters N-Gram may overlap example with N=4 unicode conference => unic,nico,icod,code,de c,e co, con,...

6 Thierry Sourbier N-Grams Advantages v Advantages: Simple to implement Increased tolerance for typos Free morphology Language independent v Advantages: Simple to implement Increased tolerance for typos Free morphology Language independent

7 Thierry Sourbier N-Grams Disadvantages v Disadvantages: Index is bigger Minimum query length is N characters shorter query will yield to no results May introduce noise sometime the system may be too tolerant (e.g.: a query to standing may send back pages containing understand) Not as good as linguistic based IR system. no explicit word normalization possible v Disadvantages: Index is bigger Minimum query length is N characters shorter query will yield to no results May introduce noise sometime the system may be too tolerant (e.g.: a query to standing may send back pages containing understand) Not as good as linguistic based IR system. no explicit word normalization possible

8 Thierry Sourbier What value should N have? v N is language dependent Typically we use a value between 1 and 6 v High N-gram size improves quality, but reduces tolerance and increases the minimal query size v Some languages may require more than one N-Gram size Japanese example v N is language dependent Typically we use a value between 1 and 6 v High N-gram size improves quality, but reduces tolerance and increases the minimal query size v Some languages may require more than one N-Gram size Japanese example

9 Thierry Sourbier Client-side v Must be compatible with most browsers We restrict ourselves to HTML We use the standard encodings for each language for our pages: many people still use browsers that are not Unicode friendly this makes content editing easier v Must be compatible with most browsers We restrict ourselves to HTML We use the standard encodings for each language for our pages: many people still use browsers that are not Unicode friendly this makes content editing easier

10 Thierry Sourbier Using a FORM v The parameters of the query are passed via the URL to a CGI script e.g: http://www.my_site.com/my_script?query=%22San+Jose%22 v What is the charset of the data sent back from the client? v The parameters of the query are passed via the URL to a CGI script e.g: http://www.my_site.com/my_script?query=%22San+Jose%22 v What is the charset of the data sent back from the client?

11 Thierry Sourbier URL Encoding Issues v Different browsers have different behaviors Example: a Japanese query Could be submitted to the server as:...search.pl?Query=%93%FA%96%7B%8C%EA Or by another browser as:...search.pl?Query=%26%2326085%3B%26%23264 12%3B%26%2335486%3B v Different browsers have different behaviors Example: a Japanese query Could be submitted to the server as:...search.pl?Query=%93%FA%96%7B%8C%EA Or by another browser as:...search.pl?Query=%26%2326085%3B%26%23264 12%3B%26%2335486%3B

12 Thierry Sourbier FORM and CGI v The server tells the client which encoding to use at the HTTP level … …. v The server tells the client which encoding to use at the HTTP level … ….

13 Thierry Sourbier FORM and CGI v The client returns the information to the script using the Private FORM/CGI Protocol A hidden form field adds a parameter to the query which identifies the locale...... v The client returns the information to the script using the Private FORM/CGI Protocol A hidden form field adds a parameter to the query which identifies the locale......

14 Thierry Sourbier Displaying the Results v Simple if only one code set per page is required v For multilingual content: use UTF-8 use multiples frames v Unexpected browser behavior v Simple if only one code set per page is required v For multilingual content: use UTF-8 use multiples frames v Unexpected browser behavior

15 Thierry Sourbier Conclusion v Solutions exist to provide a robust multilingual search engine v Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers v Solutions exist to provide a robust multilingual search engine v Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

16 Thierry Sourbier Q&A Thierry Sourbier Software Developer tsourbier@research.intl.com Thierry Sourbier Software Developer tsourbier@research.intl.com


Download ppt "Keys to Building a Multilingual Search Engine Thierry Sourbier."

Similar presentations


Ads by Google