© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling Web Pages Collection with NE annotations NERC-FE Multilingual NERC and Name Matching Multilingual and Multimedia Fact Extraction XHTML pages XML pages Insertion into the data base Products Database User Interface End user

© NCSR, Frascati, July 18-19, 2002 Focused Crawling Exploitation of standard search engines Exploitation of standard search engines Exploitation of a language identification module Exploitation of a language identification module Exploitation of the page filtering module of the web spidering tool Exploitation of the page filtering module of the web spidering tool

© NCSR, Frascati, July 18-19, 2002 Corpus formation (for the needs of page validation) Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages

© NCSR, Frascati, July 18-19, 2002 Corpus Formation Tool (CFT): Task methodology Following the reviewers’ suggestion: Following the reviewers’ suggestion:   Two annotators should be involved in manual classification of Web pages. Laptop domain: Identify and classify pages into 1 out of 4 categories: Laptop domain: Identify and classify pages into 1 out of 4 categories:  Product offerings, Product descriptions, Product announcements, Other The 3 first categories will consist the positive pages. The 3 first categories will consist the positive pages. Performance will be evaluated on each category separately. Performance will be evaluated on each category separately.

© NCSR, Frascati, July 18-19, 2002 CFT improvements over v.1 CFT v.1: Reads manually created feature definition files (FDFs). CFT v.1: Reads manually created feature definition files (FDFs). New version reads the ontology XML file, plus 1 XML lexicon file(s) for monolingual corpus (>1 lexicons for multilingual). New version reads the ontology XML file, plus 1 XML lexicon file(s) for monolingual corpus (>1 lexicons for multilingual). Changes in the ontology and the lexicons are needed to improve CFT performance: Changes in the ontology and the lexicons are needed to improve CFT performance:  Addition of lexicon entries for features and attributes descriptions  In the “Convertible_numeric” data type: An attribute that denotes if the range between minvalue and maxvalue is continuous or not is needed.

© NCSR, Frascati, July 18-19, 2002 Domain-specific Spidering: Site Navigation URL NO FRAMES FRAMES LINKS FORMS SELECT LIST SEARCH BOX IMAGE MAP JAVA SCRIPT TEXT LINK IMAGE LINK TEXT CONSTANTS OTHER Split frames OK --- ------ ---

© NCSR, Frascati, July 18-19, 2002 Domain-specific Spidering: Page Validation - use of machine learning (1) Training a page classifier Training a page classifier A trained classifier can easily: A trained classifier can easily:  make a standalone application (e.g. classify a single page).  be integrated in a generic webspider. Positive pages Negative pages Page Classifier Builder Ontology 1 or more Lexicon(s) Trained page classifier

© NCSR, Frascati, July 18-19, 2002 Domain-specific Spidering: Page Validation - use of machine learning (2) Experimented with various classifiers: Experimented with various classifiers:  k-NN, Naïve Bayes, J4.8 (C4.5), SMO (SVMs), AdaBoostM1 (J4.8). SMO, Adaboost perform best. SMO, Adaboost perform best. SMO is currently used SMO is currently used

© NCSR, Frascati, July 18-19, 2002 Experimentation on several design parameters is performed. Experimentation on several design parameters is performed.  Input.  Link content: Tokenization of the hyperlink’s URL + possibly useful attributes (e.g. “alt”) + child HTML elements.  Link context: Close ancestor HTML elements.  Feature selection.  Not based on ontology - features are selected automatically.  Features are simple words.  Bag-of-words approach (separate features for content/context?) Domain-specific Spidering: Link assessment (1)

© NCSR, Frascati, July 18-19, 2002  Scoring function.  A link is scored according to its distance from a product page. Options: linear, discretized (e.g. “1”, “2-4”,”4-10”, “more than 10”).  Learner for numeric prediction (or classifier for discrete scoring function).  M5', regression (linear/locally weighted). Domain-specific Spidering: Link assessment (2)

© NCSR, Frascati, July 18-19, 2002 Link Assessment: Training a Hyperlink scorer (1) Training is based on a small number of Web sites (e.g. 5 sites). Training is based on a small number of Web sites (e.g. 5 sites). All positive pages of these sites should be located. All positive pages of these sites should be located. Our approach: Semi-automatic classification. Our approach: Semi-automatic classification.  Requires two trained Web page classifiers, each trained by a different learning algorithm.  Each page is classified by both classifiers.  If both classifies agree on a page, they are likely to be correct.  Conflicts are logged and resolved by a human.

© NCSR, Frascati, July 18-19, 2002 For every site: For every site: Page classifier B Page-Validating & Link-Grabbing Web Spider Homepage Page classifier A Manually classified positives Conflicts Automatically classified positives Unscored links Training Phase Link Scorer Scored links Link Assessment: Training a Hyperlink scorer (2)

© NCSR, Frascati, July 18-19, 2002 Feed the scored links for all sites to the Hyperlink scoring learner. Feed the scored links for all sites to the Hyperlink scoring learner. Scored links Scored links Scored links Hyperlink Scoring Learner Hyperlink scorer Link Assessment: Training a Hyperlink scorer (3)

© NCSR, Frascati, July 18-19, 2002 Output n scorers for n-fold cross-validation (n=number of sites) Hyperlink scorer Scored links Scored links Scored links Cross-validating Hyperlink Scoring Learner Algorithm, parameters, etc. Machine learning expert Link Assessment: Training a Hyperlink scorer (4)

© NCSR, Frascati, July 18-19, 2002 Draw the curve (#followed-links, #positive-found) Draw the curve (#followed-links, #positive-found) Link Scoring Evaluation Spider Homepage Statistics Hyperlink scorer Identified positive pages Link Assessment: Evaluating a Hyperlink scorer

© NCSR, Frascati, July 18-19, 2002 Corpus Collection (for the needs of NERC + FE) The initial corpora for the 4 languages were not uniform: The initial corpora for the 4 languages were not uniform:  each site was responsible for its own corpus  different numbers of pages per corpus for every language  size differences in Training and Testing corpora  different methods of corpus collection have been followed:  manual collection  use of all the pages provided by the Web Spidering Tool  selection of pages provided by the Web Spidering Tool  Results: some corpora contained many pages from the same site, some corpora had pages from sites not present in the training corpus some not

© NCSR, Frascati, July 18-19, 2002 Corpus Collection Methodology (1) Domain independent methodology:  Random selection of at least 50 domain relevant sites per language  Investigation of domain characteristics, e.g. presentation types, in relation to the IE tasks performed by the systems  Quantitative analysis of domain characteristics per language  Statistics  Selection of pages for each corpus according to the Statistics for the corresponding language  The Training corpus is equal in size to the Testing corpus and their size is the same for all languages and agreed between the partners for each domain  50 pages Training and 50 pages Testing for the 1 st domain

© NCSR, Frascati, July 18-19, 2002 English English French French HellenicItalian TOTAL SITES 52505450 A138% 42% 42%59%45% A21%2%0%0% A34%6%2%2% B139%26%20%40% B23%11%9%8% B38%0%3%3% B41%0%3%0% B55%5%0%2% B6B6B6B60%3%3%0% B7B7B7B70%5%2%0% Corpus Collection Methodology (3)

© NCSR, Frascati, July 18-19, 2002 Domain specific principles (agreed by partners):  The maximum number of pages from one site and type is fixed for all languages  max. number is 4 in the 1 st domain  A subset of the pages in Testing corpus must come from sites not represented in the Training corpus  no less than 30% of the Testing corpus in the 1 st domain  All pages used in the corpora of CROSSMARC must be saved with images and a name indicating page origin and characteristics Corpus Collection Methodology (4)

© NCSR, Frascati, July 18-19, 2002 New Training & Testing Corpora (1 st Domain)  New size for all languages: 50 pages Training and 50 pages Testing  Effort to keep as many from the old annotated pages as possible in the new corpora  Effort to include pages from as many sites as possible in each language Corpus Collection Methodology (5)

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Similar presentations

Presentation on theme: "© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Similar presentations

Presentation on theme: "© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling."— Presentation transcript:

Similar presentations

About project

Feedback