Presentation is loading. Please wait.

Presentation is loading. Please wait.

2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08.

Similar presentations


Presentation on theme: "2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08."— Presentation transcript:

1 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08

2 Human language technology language resources –corpora –dictionaries language tools –language resource organizing and retrieval tools –morphology –syntax –semantics –...

3 Text availability for building corpora written language –flood of text in digital form –“cheap” sources spoken language –difficulties in data collecting problems of recording problems of transcription problems of spontaneity of speakers “expensive” source (typing) both language varieties –corpus as text in digital form

4 WWW as a text source estimation of words accessible through Altavista 2000-02 (source: Greg Grefenstette, XRCE, 2000-02) automated conversion of texts to a standardized format needed

5 Corpus encoding standards pre-mark-up encoding SGML (’80 and mid-’90) –Text Encoding Initiative (TEI) –Corpus Encoding Standard (CES) Ide et al. (1996) XML (last couple of years) –XCES (XML version of CES) Ide, Bonhomme & Romary (2000)

6 Conversion to XML 2XML –tool for conversion –input formats HTML RTF –output format XML

7 2XML 1 producer –Institute of linguistics, Faculty of Philosophy, University of Zagreb programming –Softleks d.o.o., Zagreb platforms –Windows 9x/ME/NT/2000 requirements –Internet Explorer 5.* to run

8 2XML 2 principle: two-step conversion 1st step –input: HTML or RTF –output: intermediate “dirty” XML 2nd step –input: “dirty” XML –used-defined script applied to it –output: XML document

9 2XML Conversion: step 1

10 2XML Conversion: step 2

11 2XML user- defined script

12 2XML Goodies goodies –XML tree labeling –XML text editing –execute script on load –batch processing: whole directory

13 2XML: tree labeling & editing

14 2XML Tokenizer program which tokenizes XML files output in two formats –tokenized XML file –tabbed file

15 2XML Tokenizer 2

16 Tokenizer output: dic file

17 Tokenizer output: XML file

18 2XML Marko Tadić (marko.tadic@ffzg.hr) Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08


Download ppt "2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr) Tübingen, 2000-11-08."

Similar presentations


Ads by Google