An exercise in conversion Dirk eHumanities
the task the method the lessons the result ◦demo
JapAM Descartes Correspondence ca. 700 letters 69,237 lines 600 formulas 4.2 MB (without the 311 pictures)
CKCC corpus Descartes XML : Text Encoding Initiative (TEI) ~ 35,000 elements, of which 7,200 metadata 7,700 paragraphs 6,200 formulas 6,000 text-formattings 4,200 structure 2,900 page-breaks 538 images
observation non-algorithmic changes consolidation proofs
use digital equipment: -your text-editor -your scripting language -your regular expressions
replace =(.*?)$ by match1 ???
...formulasmetaclosers... conversion process canonicalinitialcorrectedimprovedchecked metadata combining
convert.pl 100 KB of program code text = 25 densely typed pages = 3427 lines of which 2175 real code lines Code/Input = 1/32
1/3 of the tasks need 2/3 of the code formulas: (2)37 % headers, openers, closers:(3)16 % meta and images: (3)11 % run time of same tasks formulas:(2)29 % headers, openers, closers:(3) 6 % meta and images(3)10 % total run time(25)40 sec
1. Unicode is your friend 2. Split into many subtasks 3. task = configuration + workflow 4. Count and check 5. Performance matters 6. Do not give up automation
(2a) that can be run separately (2b) that can be reordered easily
was 30+ seconds is now 2.07 seconds many new subtasks based on same template (gain = 15 * 30 = 7.5 min per run) many, many runs before everything is OK (gain = 100 * 7.5 = 12.5 hours CPU-time)
we used a lot of expert knowledge which has all been transferred to - the source - consolidated extra inputs so the conversion is still repeatable and modifiable sourceformulasmetaclosersresults corrections hints CKCC conversion program