Presentation is loading. Please wait.

Presentation is loading. Please wait.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Similar presentations


Presentation on theme: "CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”"— Presentation transcript:

1 CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos” Final Project Review Luxembourg, October 31, 2003

2 Final Review “Crawling and Spidering” Luxembourg, 31 October 20032 Web Pages Collection: Focused Crawler Identifies web sites that are of relevance to a particular domain. It combines: a crawler that exploits the topic-based Web site hierarchies used by various search engines a crawler that submits to a search engine queries from the domain ontologies and lexicons of CROSSMARC a crawler that takes a set of ‘seed’ pages and conducts a ‘similar pages’ search from advanced search engines –The list of Web sites produced is filtered

3 Final Review “Crawling and Spidering” Luxembourg, 31 October 20033 Web Pages Collection: crawler customization –change of settings of crawler configuration files –experimentation and evaluation to find the optimal settings for each version as well as their optimal combination –train the light spidering module that filters the crawler results

4 Final Review “Crawling and Spidering” Luxembourg, 31 October 20034 Web Pages Collection: Crawler Evaluation more than one experimentation cycle may be needed depending on the domain and language our evaluation methodology provides a good way of comparing different initial settings of the crawler Language1 st Domain Precision (%) 2 nd Domain Precision (%) English45,287,5 Italian25,641,7 Greek26,053,2 French57,130,8

5 Final Review “Crawling and Spidering” Luxembourg, 31 October 20035 Site navigation: traverses a Web site, collecting information from each page visited and forwarding it to the “Page-Filtering” and “Link-Scoring” modules Page-filtering is responsible for deciding whether a page is an interesting one and should be stored or not –before storing a page, its language is identified –the page is also converted to XHTML Link-scoring validates the links to be followed. Only links with a score above a certain threshold are followed. Web Pages Collection: Web sites spider

6 Final Review “Crawling and Spidering” Luxembourg, 31 October 20036 The following types of URLs are supported: Frame links, Text links, Image links, Image maps JavaScript cases, HTML forms in order to discover and extract more URLs in the Web page. Each URL is checked if it redirects to another site points to a non-HTML file is already in the queue of visited URLs Web Pages Collection: Web sites spider - Navigation

7 Final Review “Crawling and Spidering” Luxembourg, 31 October 20037 Two approaches were investigated: –Machine learning: The WebPageClassifier tool was developed that reads a corpus of positive and negative Web pages, translates it into a feature vector format, and uses learning algorithms to construct the Web page classifier. –Heuristics: The heuristics based filter accepts as input the Web page, in the form of a token sequence, compares each token to a list of regular expressions from the domain lexicon in use. Web Pages Collection: Web sites spider – Page Filtering

8 Final Review “Crawling and Spidering” Luxembourg, 31 October 20038 Two approaches were investigated: –Machine learning: The training system for Link scoring takes as inputtraining system for Link scoring a collection of domain-specific web sites, the positive web pages within these web sites, the domain ontology and one or more domain lexicon files from which it creates the training data set. –Heuristics: The heuristics based link scorer takes as input the link’s text content as well as its context (left and right), parses the three strings looking for domain relevant information based on a score-table, combines the scores of the three strings using a weighted function. Web Pages Collection: Web sites spider – Link scoring

9 Final Review “Crawling and Spidering” Luxembourg, 31 October 20039 Web Pages Collection: spider customization –Use the same navigation mechanism –Use the machine learning based “page filtering” which requires: –the domain ontology and lexicons –the creation of a representative training corpus (CROSSMARC provides the Corpus Formation tool)Corpus Formation tool –the use of the WebPageClassifer tool to construct the domain-specific classifier –Use the rule-based approach suggested for link scoring which requires: –the specification of new settings in the configuration file of the link scoring module –experimenting with each specification until the optimal setting is found

10 Final Review “Crawling and Spidering” Luxembourg, 31 October 200310 Corpus formation for the needs of page filtering Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages

11 Final Review “Crawling and Spidering” Luxembourg, 31 October 200311 Page classifier B Page-Validating & Link-Grabbing Web Spider Homepage Page classifier A Manually classified positives Conflicts Automatically classified positives Unscored links Training Phase Link Scorer Scored links Corpus formation for the needs of link scoring

12 Final Review “Crawling and Spidering” Luxembourg, 31 October 200312 Web Pages Collection: Web sites spider Evaluation Page Filtering Language 1 st Domain F-measure (%) 2 nd Domain F-measure (%) English96,983,2 Italian93,773,7 Greek92,787,9 French96,982,3 –we are able to identify with a high degree of confidence whether a page in interesting or not according to the domain –results can be improved further so far only ontology-based features are used combination with statistically selected one a promising research direction

13 Final Review “Crawling and Spidering” Luxembourg, 31 October 200313 Link scoring: –Rather poor the results of both methods –An issue that could be investigated is the combination of the two methods to improve recall results –Concluding, the task of scoring links without visiting them remains a very challenging one and is becoming more important in the general setting of topic- specific search engines and portals Web Pages Collection: Web sites spider Evaluation

14 Final Review “Crawling and Spidering” Luxembourg, 31 October 200314 Concluding Remarks Crawler Applied in both domains of the project Customization instructions are provided The tool and the corpora used in both domains and four languages will be available for research purposes Spider Applied in three domains Customization methodology and tools are provided The corpora collected for page filtering and link scoring will be available for research purposes


Download ppt "CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”"

Similar presentations


Ads by Google