Download presentation
Presentation is loading. Please wait.
Published byMarilyn Brooks Modified over 8 years ago
1
WP1: Conversion of HTML Web Pages to XML format CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
2
WebXimmler: main components JTidy A java transposition of the popular tidy html cleaning tool Jakarta-ORO A set of text-processing Java classes that provide Perl5 compatible regular expressions, and utility classes for performing string substitutions, splits, filtering filenames, etc. The WebXimmler main component It preprocesses the webpages: understanding their encoding, and converting the pages to utf-8 transforming all the parametric entities into numeric entities and all the numeric entities into UTF-8 characters, correcting possible problems that may invalidate the behaviour of the jtidy component.
3
WebXimmler: preprocessing of the pages WebXimmler Encoding extraction: Accepts hints from the user (or the Crossmarc system), they’re based on the provenience of the pages. Here follows typical encodings for the four countries involved in Crossmarc): Italian French and English pages: latin-1 (ISO-8859-1) or cp-1252 Greek pages: cp-1253 Finds, via a regular expression, occurrence of the meta tag HTTP- EQUIV and of its “charset” attribute. If present, the value of this attribute overrides the hint received. Pages are then converted to UTF-8 encoding A table of substitutions help transforming all the parametric entities into numeric entities, then all the numeric entities are converted into the UTF-8 characters they represent.
4
WebXimmler: preprocessing of the pages Cleaning Procedure the typical dirtiness of a HTML webpage reported as a series of warnings handled without problems serious inconsistencies in the original HTML files reported as errors JTidy refuses to give an output Force-Output=yes This option forces the jtidy component to output the page Jtidy guess the best representation of the original page, at the same time it tries to maintain the desired output format
5
WebXimmler: software architecture WebXimmler executable jar file It comprises a java encoding converter (we tried it successfully with almost any encoding) Preprocessing operations on the files (previously descripted) A lib folder with two jar files jakarta-oro-2.0.6.jar Tidy.jar A Corpus folder Some batches that facilitate immediate corpus processing USAGE: java -jar.\build\webximmler.jar -filter -encoding[“hint”] -xml input.htm output.xml
6
WebXimmler: preprocessing of the pages Current development We are trying to fix some problems that cause the Jtidy component of WebXimmler to output wrong xml format Main causes SCRIPTS Delete them? Handle them in some way? Comment their content? Other minor issues not handled correctly by jtidy, we’re categorizing them
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.