Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Resource Conversion William Lewis CSU Fresno
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 2 Preliminaries Eventually any resource will become obsolete Resource conversion is inevitable One should plan from the start for eventual conversion Encode your resource such that it is Migratable is Reusable will Endure
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 3 Best Practice? Simons (this symposium) argues there are 3 relevant formats for encoding data: 1. Working form 2. Presentation form 3. Archival form (1) is tied to particular software. (2) is generally generated from (1), but itself is often “semantically” sparse.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 4 Best Practice! Encoding resource in archival form (3) insures that the resource is reusable, facilitating interoperability the resource can be migrated to other formats (including presentation formats) the resource endures
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 5 Data Survivability Converting to archival XML form provides for data reuse and insures survivability: Working Form Archival Form Conversion Process HTMLPDFOther XML form
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 6 Data Conversion (“text”) Highly dependent on flexibility of working form and related software Converting from proprietary, binary format most difficult – to be avoided Converting from plain text output easiest – might be avoided due to potential data loss Converting from enriched text form (Unicode compliant) or XML coded data is best, but may not always be possible
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 7 Data Conversion Working Form Archival Form Conversion Process
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 8 Data Conversion Working Form Archival Form Inter- mediary Form (Enriched text) CP
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 9 Intermediary Conversion Use Print Function or “Save as…” Print Function or other file convert Data Query (direct to XML?) As is Resources in Word Processor Spreadsheet Proprietary Flat File DB Relational DB XML or enriched text (inc. Shoebox)
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 10 Intermediary Conversion Important: Insure that conversion to Intermediary Form suffers no data loss, or that the data loss suffered is minimal Danger in Save As (and Print to file), in that data loss is possible
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 11 Final Conversion Intermediary to Archival Form (Best Practice XML): Font/Character transforms Macros or methods for enriching and aligning data elements Tables or “glossaries” defining how content and form should be interpreted
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 12 Data Conversion – Case Study Converting Hopi Dictionary (Hill et al 1998) from working form (legacy format) Purpose: Build software to extract relevant data from working form Generate reusable archival format For dissemination on the Web For use by others To preserve data should DB software become unusable
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 13 Hopi Dictionary Example entry from Hopi Dictionary:
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 14 Hopi Dictionary Conversion Until now: Generated text file from DB Manually converted IPA fonts in MSWord Generated PDFs for dissemination
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 15 Hopi Dictionary Conversion New Process: Convert DB format to “enriched” text Software transforms for fonts from text format (Unicode compliant IPA) Identify the grammatical concepts used in entries, linked to GOLD (Farrar & Langendoen, this symposium) Generate XML – structured using modified EMELD IGT format (Bow, Hughes & Bird 2003)
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 16 Archival Hopi Dictionary Record
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 17 Recipe for Resource Conversion Choose data format that is easily archived Where the software provides for data migration, or, The data format itself is easily converted Use existing software to bring you as close to Archival Form as possible (Intermediary Form) Clearly identify Content and structural semantics (“terms”) Fonts used (and transforms) Data alignment Construct transforms/macros/software to convert to Archival form