Download presentation
Presentation is loading. Please wait.
Published byEustacia Hodge Modified over 9 years ago
1
2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08
2
Human language technology language resources –corpora –dictionaries language tools –language resource organizing and retrieval tools –morphology –syntax –semantics –...
3
Text availability for building corpora written language –flood of text in digital form –“cheap” sources spoken language –difficulties in data collecting problems of recording problems of transcription problems of spontaneity of speakers “expensive” source (typing) both language varieties –corpus as text in digital form
4
WWW as a text source estimation of words accessible through Altavista 2000-02 (source: Greg Grefenstette, XRCE, 2000-02) automated conversion of texts to a standardized format needed
5
Corpus encoding standards pre-mark-up encoding SGML (’80 and mid-’90) –Text Encoding Initiative (TEI) –Corpus Encoding Standard (CES) Ide et al. (1996) XML (last couple of years) –XCES (XML version of CES) Ide, Bonhomme & Romary (2000)
6
Conversion to XML 2XML –tool for conversion –input formats HTML RTF –output format XML
7
2XML 1 producer –Institute of linguistics, Faculty of Philosophy, University of Zagreb programming –Softleks d.o.o., Zagreb platforms –Windows 9x/ME/NT/2000 requirements –Internet Explorer 5.* to run
8
2XML 2 principle: two-step conversion 1st step –input: HTML or RTF –output: intermediate “dirty” XML 2nd step –input: “dirty” XML –used-defined script applied to it –output: XML document
9
2XML Conversion: step 1
10
2XML Conversion: step 2
11
2XML user- defined script
12
2XML Goodies goodies –XML tree labeling –XML text editing –execute script on load –batch processing: whole directory
13
2XML: tree labeling & editing
14
2XML Tokenizer program which tokenizes XML files output in two formats –tokenized XML file –tabbed file
15
2XML Tokenizer 2
16
Tokenizer output: dic file
17
Tokenizer output: XML file
18
2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.