2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Human language technology language resources –corpora –dictionaries language tools –language resource organizing and retrieval tools –morphology –syntax –semantics –...
Text availability for building corpora written language –flood of text in digital form –“cheap” sources spoken language –difficulties in data collecting problems of recording problems of transcription problems of spontaneity of speakers “expensive” source (typing) both language varieties –corpus as text in digital form
WWW as a text source estimation of words accessible through Altavista (source: Greg Grefenstette, XRCE, ) automated conversion of texts to a standardized format needed
Corpus encoding standards pre-mark-up encoding SGML (’80 and mid-’90) –Text Encoding Initiative (TEI) –Corpus Encoding Standard (CES) Ide et al. (1996) XML (last couple of years) –XCES (XML version of CES) Ide, Bonhomme & Romary (2000)
Conversion to XML 2XML –tool for conversion –input formats HTML RTF –output format XML
2XML 1 producer –Institute of linguistics, Faculty of Philosophy, University of Zagreb programming –Softleks d.o.o., Zagreb platforms –Windows 9x/ME/NT/2000 requirements –Internet Explorer 5.* to run
2XML 2 principle: two-step conversion 1st step –input: HTML or RTF –output: intermediate “dirty” XML 2nd step –input: “dirty” XML –used-defined script applied to it –output: XML document
2XML Conversion: step 1
2XML Conversion: step 2
2XML user- defined script
2XML Goodies goodies –XML tree labeling –XML text editing –execute script on load –batch processing: whole directory
2XML: tree labeling & editing
2XML Tokenizer program which tokenizes XML files output in two formats –tokenized XML file –tabbed file
2XML Tokenizer 2
Tokenizer output: dic file
Tokenizer output: XML file
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,