Information Retrieval and Web Design

Information Retrieval and Web Design
Lecture (10) Prepared by Dr. Dunia Hamid Hameed

Web Page Pre-Processing
1. Identifying different text fields: In HTML, there are different text fields, e.g., title, metadata, and body. Identifying them allows the retrieval system to treat terms in different fields differently.

2. Identifying anchor text: Anchor text associated with a hyperlink is treated specially in search engines because the anchor text often represents a more accurate description of the information contained in the page pointed to by its link.

3. Removing HTML tags: The removal of HTML tags can be dealt with similarly to punctuation. One issue needs careful consideration, which affects proximity queries and phrase queries.

4. Identifying main content blocks: A typical Web page, especially a commercial page, contains a large amount of information that is not part of the main content of the page. For example, it may contain banner ads, navigation bars, copyright notices, etc., which can lead to poor results for search and mining.

Finding Main Blocks Methods
1. Partitioning based on visual cues: This method uses visual information to help find main content blocks in a page. Visual or rendering information of each HTML element in a page can be obtained from the Web browser.

A machine learning model can then be built based on the location and appearance features for identifying main content blocks of pages.

Finding Main Blocks Methods
2. Tree matching: This method is based on the observation that in most commercial Web sites pages are generated by using some fixed templates. The method thus aims to find such hidden templates.

Tree matching of multiple pages from the same site can be performed to find such templates. Once a template is found, we can identify which blocks are likely to be the main content blocks based on the following observation: the text in main content blocks are usually quite different across different pages of the same template, but the non main content blocks are often quite similar in different pages.

Duplicate Detection Copying a page is usually called duplication or replication. Copying an entire site is called mirroring.

Advantages of Duplication
Duplicate pages and mirror sites are often used to improve efficiency of browsing and file downloading worldwide due to limited bandwidth across different geographic regions and poor or unpredictable network performances.

Advantages of Duplication Detection
1- Reduce the index size. 2- Improve search results.

Duplication Detection Methods
Several methods can be used to find duplicate information: 1. The simplest method is to hash the whole document, e.g., using the MD5 algorithm. 2. Computing an aggregated number (e.g., checksum). 3. One efficient duplicate detection technique is based on n-grams (also called shingles).

N-gram (Shingles) Methods
An n-gram is simply a consecutive sequence of words of a fixed window size n. For example, the sentence, “John went to school with his brother,” can be represented with five 3-gram phrases “John went to”, “went to school”, “to school with”, “school with his”, and “with his brother”. Note that 1-gram is simply the individual words.

Information Retrieval and Web Design

Similar presentations

Presentation on theme: "Information Retrieval and Web Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval and Web Design

Similar presentations

Presentation on theme: "Information Retrieval and Web Design"— Presentation transcript:

Similar presentations

About project

Feedback