TeX2Star A System for Converting TeX to OpenOffice By Jeffrey Starr
Overview ● Why does conversion matter? ● Why has it not already been done? – Why is it difficult? ● Proposal: TeX->OpenOffice ● Proposal: TeX->DVI->OpenOffice ● Solution ● Unsolved problems
What is OpenOffice? ● Open Source office suite ● Based on StarOffice, currently owned by Sun Microsystems ● Cross-Platform ● XML based, standards driven ● Semantic-based format
What is TeX? ● Written by Donald E. Knuth ● Solution to declining standards in mathematical typography ● Heavily used in mathematics and physics ● Both a program and a programming language ● Presentation-based format
Why Bother to Convert? ● TeX rare outside mathematical circles ● Conflicts with publishing software ● Does not fit within current word processing model ● TeX's purpose to is to produce journal-quality typography, not facilitate editing of content.
Aside: Editable Output ● TeX has many presentation outputs: – DVI – PostScript – PDF – PNG – TIFF – Fax ● TeX has no direct editable outputs.
Solution: TeX->OpenOffice ● Why use the outputs? Read the original document. ● Perfect knowledge of content and (presentational) intent ● Write a program that reads TeX and outputs OpenOffice, instead of DVI
Problems with TeX->OpenOffice ● TeX is a large system – Eight years development – Too large for a semester ● Irregular ● Non-Balanced ● Many special cases
TeX is Irregular ● An irregular language is one in which typical rules of processing are violated ● Irregular '\atop': (TeX) – {numerator \atop denominator} ● Regular '\frac': (LaTeX) – \frac{numerator}{denominator}
TeX is not balanced ● A language that is balanced will have an explicit beginning and end to each grouping ● Non-balanced font commands: (TeX) – \bf this is bold \rm this is normal, roman text ● Balanced font commands: (LaTeX) – \textbf{this is bold} this is back to normal
TeX has many special cases ● \par may either: – explicitly end a paragraph – do nothing (if in math mode) – do nothing (if in restricted horizontal mode) – tell TeX to build the current page ● \par is also irregular (acts on material already processed and in the reverse direction) and unbalanced (may or may not be proceeded by \indent, a primitive to start a paragraph)
Solution: TeX->DVI->OpenOffice ● Let TeX deal with TeX ● Run TeX on the original text ● Read the resultant DVI output ● Process the DVI output to OpenOffice
Problem: Lack of semantic data ● DVI contains font definitions, text stream, and description of black boxes ● Fonts contain characters, but do not say what those characters are – Especially a problem with kerning “ff” vs. “ff” – Also a problem with bold and italics text --- bold and italics are their own fonts
Solution: Add Annotations ● Use interpositioning and the TeX primitive '\special' to send extra information to DVI file ● \special leaves comments that can be read later ● Reading the DVI with proper annotation allows the text to retain some level of semantic information ● Difference between knowing that the next character is smaller and raised versus knowing that the next character is a superscript
Problem: Unbalanced Tags ● Some primitives are balanced, but many are not ● Tags may affect the document for an arbitrary length of time or are local to a paragraph or specific block of text
Solution: Balancing ● Algorithm: – Given: database of tags ● start tag, end tag, 'insert end tag' tags – Go through list of tags, find one that needs help balancing – Go forward along list, finding nearest tag that closes the previous tag, or end of document – Insert end of tag into the list of tags
Post Document Editing ● Further balancing and insertion of tags may be necessary after first sweep through file ● Tables: – OpenOffice format requires number of columns to be specified – We don't know how many columns will be needed until after we read the entire table – Solution: After processing, go back and insert the needed information
Unsolved Problems ● Footnotes: – Defined by position in the page – Automatic positioning conflicts with paragraph detection tool – Unable to discern between footnotes, extra paragraph, header, or footer ● Non-English alphabets
Conclusion ● Semantics of document are lost in TeX itself, so no hope of recovery ● Overt presentation can be recovered for editing ● Method works to translate an irregular, non- well formed language into a regular, well- formed language (XML)