CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos” Final Project Review Luxembourg, October 31, 2003
Final Review “Crawling and Spidering” Luxembourg, 31 October Web Pages Collection: Focused Crawler Identifies web sites that are of relevance to a particular domain. It combines: a crawler that exploits the topic-based Web site hierarchies used by various search engines a crawler that submits to a search engine queries from the domain ontologies and lexicons of CROSSMARC a crawler that takes a set of ‘seed’ pages and conducts a ‘similar pages’ search from advanced search engines –The list of Web sites produced is filtered
Final Review “Crawling and Spidering” Luxembourg, 31 October Web Pages Collection: crawler customization –change of settings of crawler configuration files –experimentation and evaluation to find the optimal settings for each version as well as their optimal combination –train the light spidering module that filters the crawler results
Final Review “Crawling and Spidering” Luxembourg, 31 October Web Pages Collection: Crawler Evaluation more than one experimentation cycle may be needed depending on the domain and language our evaluation methodology provides a good way of comparing different initial settings of the crawler Language1 st Domain Precision (%) 2 nd Domain Precision (%) English45,287,5 Italian25,641,7 Greek26,053,2 French57,130,8
Final Review “Crawling and Spidering” Luxembourg, 31 October Site navigation: traverses a Web site, collecting information from each page visited and forwarding it to the “Page-Filtering” and “Link-Scoring” modules Page-filtering is responsible for deciding whether a page is an interesting one and should be stored or not –before storing a page, its language is identified –the page is also converted to XHTML Link-scoring validates the links to be followed. Only links with a score above a certain threshold are followed. Web Pages Collection: Web sites spider
Final Review “Crawling and Spidering” Luxembourg, 31 October The following types of URLs are supported: Frame links, Text links, Image links, Image maps JavaScript cases, HTML forms in order to discover and extract more URLs in the Web page. Each URL is checked if it redirects to another site points to a non-HTML file is already in the queue of visited URLs Web Pages Collection: Web sites spider - Navigation
Final Review “Crawling and Spidering” Luxembourg, 31 October Two approaches were investigated: –Machine learning: The WebPageClassifier tool was developed that reads a corpus of positive and negative Web pages, translates it into a feature vector format, and uses learning algorithms to construct the Web page classifier. –Heuristics: The heuristics based filter accepts as input the Web page, in the form of a token sequence, compares each token to a list of regular expressions from the domain lexicon in use. Web Pages Collection: Web sites spider – Page Filtering
Final Review “Crawling and Spidering” Luxembourg, 31 October Two approaches were investigated: –Machine learning: The training system for Link scoring takes as inputtraining system for Link scoring a collection of domain-specific web sites, the positive web pages within these web sites, the domain ontology and one or more domain lexicon files from which it creates the training data set. –Heuristics: The heuristics based link scorer takes as input the link’s text content as well as its context (left and right), parses the three strings looking for domain relevant information based on a score-table, combines the scores of the three strings using a weighted function. Web Pages Collection: Web sites spider – Link scoring
Final Review “Crawling and Spidering” Luxembourg, 31 October Web Pages Collection: spider customization –Use the same navigation mechanism –Use the machine learning based “page filtering” which requires: –the domain ontology and lexicons –the creation of a representative training corpus (CROSSMARC provides the Corpus Formation tool)Corpus Formation tool –the use of the WebPageClassifer tool to construct the domain-specific classifier –Use the rule-based approach suggested for link scoring which requires: –the specification of new settings in the configuration file of the link scoring module –experimenting with each specification until the optimal setting is found
Final Review “Crawling and Spidering” Luxembourg, 31 October Corpus formation for the needs of page filtering Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages
Final Review “Crawling and Spidering” Luxembourg, 31 October Page classifier B Page-Validating & Link-Grabbing Web Spider Homepage Page classifier A Manually classified positives Conflicts Automatically classified positives Unscored links Training Phase Link Scorer Scored links Corpus formation for the needs of link scoring
Final Review “Crawling and Spidering” Luxembourg, 31 October Web Pages Collection: Web sites spider Evaluation Page Filtering Language 1 st Domain F-measure (%) 2 nd Domain F-measure (%) English96,983,2 Italian93,773,7 Greek92,787,9 French96,982,3 –we are able to identify with a high degree of confidence whether a page in interesting or not according to the domain –results can be improved further so far only ontology-based features are used combination with statistically selected one a promising research direction
Final Review “Crawling and Spidering” Luxembourg, 31 October Link scoring: –Rather poor the results of both methods –An issue that could be investigated is the combination of the two methods to improve recall results –Concluding, the task of scoring links without visiting them remains a very challenging one and is becoming more important in the general setting of topic- specific search engines and portals Web Pages Collection: Web sites spider Evaluation
Final Review “Crawling and Spidering” Luxembourg, 31 October Concluding Remarks Crawler Applied in both domains of the project Customization instructions are provided The tool and the corpora used in both domains and four languages will be available for research purposes Spider Applied in three domains Customization methodology and tools are provided The corpora collected for page filtering and link scoring will be available for research purposes