WP1: Conversion of HTML Web Pages to XML format CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

Slides:



Advertisements
Similar presentations
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 9 Using Perl for CGI Programming.
Advertisements

EXtensible HyperText Markup Language Miruna Bădescu Finsiel Romania Copenhagen, 25 May 2004.
Web Service Ahmed Gamal Ahmed Nile University Bioinformatics Group
Computers: Tools for an Information Age
Thayer School of Engineering Dartmouth Lecture 2 Overview Web Services concept XML introduction Visual Studio.net.
Java Server Pages Russell Beale. What are Java Server Pages? Separates content from presentation Good to use when lots of HTML to be presented to user,
Guide to Linux Installation and Administration, 2e1 Chapter 6 Using the Shell and Text Files.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Struts 2.0 an Overview ( )
Chapter 9 Using Perl for CGI Programming. Computation is required to support sophisticated web applications Computation can be done by the server or the.
Create a Website on the CWU network Find “How to Post a Web Page with a PC”
Subcommittee 3D DATA SETS FOR LIBRARIES. SC 3D Exchange of dictionary data Cape Town, (Cape Town/Radley)3 Donald Radley Chairman, SC3D.
M ULTI - LANGUAGE FOR PHP WITH G ETTEXT Binh Quan
WorkPlace Pro Utilities.
BIRT: general info and initial experience Katia Danilova 02/27/2008.
CSCI 6962: Server-side Design and Programming Validation Tools in Java Server Faces.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
USING PERL FOR CGI PROGRAMMING
Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern.
WordFreak A Language Independent, Extensible Annotation Tool.
Lecturer: Prof. Piero Fraternali, Teaching Assistant: Alessandro Bozzon, Advanced Web Technologies: Struts–
From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference – 19 November 2003.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 9 Using Perl for CGI Programming.
introducing the Java Data Processing Framework Paolo Ciccarese, PhD On behalf of the JDPF Team Pavia, December 11, 2007.
Embedded XML Documentation for Fortran 90 and C/C++ Brett N. DiFrischia RS Information Systems NOAA | GFDL.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Test Automation For Web-Based Applications Portnov Computer School Presenter: Ellie Skobel.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
University of Nottingham School of Computer Science & Information Technology Introduction to XML 2. XSLT Tim Brailsford.
Objective: To describe the evolution of the Internet and the Web. Explain the need for web standards. Describe universal design. Identify benefits of accessible.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Mobile Site Cleanup Reducing the code errors and fixing behaviours in Cisco Mobile sites.
Programming Languages Meeting 12 November 18/19, 2014.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
National College of Science & Information Technology.
Information Retrieval in Practice
Databases (CS507) CHAPTER 2.
Unit 4 Representing Web Data: XML
OCTOPUS – SeaDataNet Format conversion tool
CSCI-235 Micro-Computer Applications
z/Ware 2.0 Technical Overview
Data Virtualization Tutorial… CORS and CIS
Institute of Informatics & Telecommunications NCSR “Demokritos”
Play Framework: Introduction
Java Servlets.
Intro to PHP & Variables
8 Mistakes to Avoid During PSD to HTML Conversion | Pixlogix Infotech Pvt. Ltd.
Introduction to javadoc
Chapter 7 Representing Web Data: XML
MSIS 655 Advanced Business Applications Programming
Fundamentals of Data Structures
Common Origination and Disbursement (COD) System Update
JSP Directives 1-Jan-19.
Tutorial 1.3 Using Element Attributes
Principles of Programming Languages
Introduction to javadoc
HTML: Pages and Tools.
Grauer and Barber Series Microsoft Access Chapter One
Lab 3: File Permissions.
Clip & Convert to ASCII Program Kelly Knapp Spring 2010
Presentation transcript:

WP1: Conversion of HTML Web Pages to XML format CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

WebXimmler: main components JTidy A java transposition of the popular tidy html cleaning tool Jakarta-ORO A set of text-processing Java classes that provide Perl5 compatible regular expressions, and utility classes for performing string substitutions, splits, filtering filenames, etc. The WebXimmler main component It preprocesses the webpages: understanding their encoding, and converting the pages to utf-8 transforming all the parametric entities into numeric entities and all the numeric entities into UTF-8 characters, correcting possible problems that may invalidate the behaviour of the jtidy component.

WebXimmler: preprocessing of the pages WebXimmler Encoding extraction: Accepts hints from the user (or the Crossmarc system), they’re based on the provenience of the pages. Here follows typical encodings for the four countries involved in Crossmarc): Italian French and English pages: latin-1 (ISO ) or cp-1252 Greek pages: cp-1253 Finds, via a regular expression, occurrence of the meta tag HTTP- EQUIV and of its “charset” attribute. If present, the value of this attribute overrides the hint received. Pages are then converted to UTF-8 encoding A table of substitutions help transforming all the parametric entities into numeric entities, then all the numeric entities are converted into the UTF-8 characters they represent.

WebXimmler: preprocessing of the pages Cleaning Procedure the typical dirtiness of a HTML webpage reported as a series of warnings handled without problems serious inconsistencies in the original HTML files reported as errors JTidy refuses to give an output Force-Output=yes This option forces the jtidy component to output the page Jtidy guess the best representation of the original page, at the same time it tries to maintain the desired output format

WebXimmler: software architecture WebXimmler executable jar file It comprises a java encoding converter (we tried it successfully with almost any encoding) Preprocessing operations on the files (previously descripted) A lib folder with two jar files jakarta-oro jar Tidy.jar A Corpus folder Some batches that facilitate immediate corpus processing USAGE: java -jar.\build\webximmler.jar -filter -encoding[“hint”] -xml input.htm output.xml

WebXimmler: preprocessing of the pages Current development We are trying to fix some problems that cause the Jtidy component of WebXimmler to output wrong xml format Main causes SCRIPTS Delete them? Handle them in some way? Comment their content? Other minor issues not handled correctly by jtidy, we’re categorizing them