Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC

Similar presentations


Presentation on theme: "Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC"— Presentation transcript:

1 Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC
Preparing Historical UpperAir Data Sets Based on Hardcopy Documents: RIHMI Experience Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC 6 Korolyov str., Obninsk, Kaluga region Russian Federation mail:

2 Preparing Historical Upper Air Data Sets Based on Hardcopy Documents: RIHMI Experience Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC, Russia ABSTRACT Digitizing hardcopy data, depending hardcopy document features and specific, requires case-by-case decisions on choice of digitizing manner (manual or OCR), on variable transformations, data management operations, quality check and, finally, on output formats. All decisions taken together, create technologies, though separate elements of these technologies may be of different origin and may be just different instruments. The paper describes solutions on digitizing hardcopy data at RIHMI-WDC (Obninsk, Russia). The elements of technologies include use of OCR (for printed documents, if OCR is acceptable), or use of manual filling MS ACCESS tables, if OCR is not acceptable. For OCR technologies, collecting data into MS EXCEL and their primary check based on conditional formatting option of MS EXCEL is used. Further steps include import of EXCEL or ACCESS tables into SAS System software, numerous data transformation and processing. For further quality check, calculation of statistics and data visualization are applied. Each suspected value discovered at this stage, is re-checked from the primary hardcopy documents including corrigendum lists of each publication. Again, the processing is done after correcting the values. Finally, the output into formats most acceptable for ERA CLIM is provided. The paper contains the detailed description of each step of the technologies. Very simple solutions such as those based on OCR (ABBYY FineReader), MS EXCEL and MS ACCESS, are shown to be effective as elements of constructed technologies. 1/1/2019 ERA CLIM ASSEMBLY READING 2013

3 ERA CLIM ASSEMBLY READING 2013
1/1/2019

4 VARIETY OF SOURCES: Upper Air Yearbook Part 1: all types on same page
ERA CLIM ASSEMBLY READING 2013 VARIETY OF SOURCES: Upper Air Yearbook Part 1: all types on same page For Slutsk (Pavlovsk) near Saint-Petersburg, contains: Radiosonde observations, Kite observations Pilot ball observations Kite observations Pilot ball observations 1/1/2019

5 STEPS OF THE TECHNOLOGY: PRE-QC OF DATA
1/1/2019 ERA CLIM ASSEMBLY READING 2013

6 STEPS OF THE TECHNOLOGY: PRE-QC OF DATA – CONDITIONAL FORMATTING ENABLES TO DETECT TYPICAL ERRORS
1/1/2019 ERA CLIM ASSEMBLY READING 2013

7 STEPS OF THE TECHNOLOGY: PRE-QC OF DATA
Different sources of errors: Errors in original books Errors from OCR Errors from manual digitizing Errors in original books (published in s), that appeared at publishing, or caused by paper quality Usual QC procedures – produce quality flags Errors from OCR processing – unlike other QC procedures that produce quality flags, here it is possible to go back to original publication as soon as possible and to correct the errors! 1/1/2019 ERA CLIM ASSEMBLY READING 2013

8 STEPS OF THE TECHNOLOGY: TYPICAL ERRORS IN OCR
Missing comma (dot or space instead) or appearing incorrect symbols – non digital cell value Loss of Minus (-) Loss of comma – value 100 or 10 times bigger than correct one Incorrect digit (4 as 1, 0 as 8, 7 as 2, and vice versa, etc.) Many other typical… 1/1/2019 ERA CLIM ASSEMBLY READING 2013

9 STEPS OF THE TECHNOLOGY: GENERAL PROCESS FOR OCR
Export of tables to ORACLE OCR by FR10 XLS Sheet – TABLE OF DATA PRE-QC1 By EXCEL conditional formatting IMPORT of XLS to SAS PROCESSING BY SAS Data variables transformed and generated if needed Data management operations Tables of statistics and statistical graphs (boxplots) PROCESSING BY SAS:OUTPUT OF DATA TO TXT FORMAT Export of tables to ORACLE 1/1/2019 ERA CLIM ASSEMBLY READING 2013

10 STEPS OF THE TECHNOLOGY: GENERAL PROCESS FOR MANUAL DIGITIZING
CONSTUCTING TABLE IN MS ACCESS AND MANUAL DIGITIZING MS ACCESS TABLE OF DATA IMPORT of MDB TABLE to SAS PROCESSING BY SAS Data variables transformed and generated if needed Data management operations Tables of statistics and statistical graphs (boxplots) PROCESSING BY SAS:OUTPUT OF DATA TO TXT FORMAT Export of tables to ORACLE 1/1/2019 ERA CLIM ASSEMBLY READING 2013

11 STEPS OF THE TECHNOLOGY: STATISTICS AND REPORTING
1/1/2019 ERA CLIM ASSEMBLY READING 2013

12 STEPS OF THE TECHNOLOGY: Post-processing of dates and times of observations
When the datasets are created, they contain obs dates and times as “book values” (needed to preserve these values!) “Book values” of date and time vary between books and between parts of books, they can be UTC (Greenwich), Moscow zone time, local zone time, etc... Transfer to UTC needs to be done after digitizing but before output of final data sets UTC date and time should be contained in final data sets in parallel with “book values” Considering the specific of situation (vast territory, variety of date/time units, etc.), it is not a simple problem! In marginal cases, month and even year could be changed RIHMI has a negative experience of simplified corrections of date and time – lesson must be taken from this! 1/1/2019 ERA CLIM ASSEMBLY READING 2013

13 STEPS OF THE TECHNOLOGY: Post-processing of dates and times of observations
Based on texts of book introductory paragraphs, comments, table descriptions, notes, and so on, identify case-by-case what is the used “book time” in each situation Next – identify the shift between book time and UTC at the period of observation considered. Looking at the USSR time zone map of old periods, - it is not simple problem, but anyway... 1/1/2019 ERA CLIM ASSEMBLY READING 2013

14 STEPS OF THE TECHNOLOGY: Post-processing of dates and times of observations
What is the used “book time” in each situation? Identify the shift between “book time” and UTC in each situation? Programming: First – using the “book values” of year, month, day, hour, minutes, to apply special transforming functions and to compose BOOK DATETIME variable of special computer format (SAS in our case) Second –using the special shifting functions for datetime special format variables, produce new UTC DATETIME variable of special computer format (SAS in our case) Third – to apply special transformation functions to UTC DATETIME variable, to decompose it and to create modified UTC values of year, month, day, hour, minutes Looks complicated, but is correct! 1/1/2019 ERA CLIM ASSEMBLY READING 2013

15 STEPS OF THE TECHNOLOGY: Compiling outputs
Outputs are combined to bundles Typical content of bundle: Readme document Scanned primary materials in PDF format (pages as images) Digitized inputs after pre-QC (EXCEL sheets with colored conditional formatting, ACCESS Mdb). Errors detected by future statistical calculations and plotting also corrected. Computer program scripts for all processing, transformations, statistical calculations and graphics, outputs into txt formats PDF files containing results of statistical calculations and graphics Data sets in the format of processing software (SAS in our case), after all processing & transformations Output data in txt format, ready to be imported by WP1 database and by Feedback Archive Document (in DOC format) – description of output data txt format 1/1/2019 ERA CLIM ASSEMBLY READING 2013

16 Conclusions and lessons
Transfer from paper to PDF images – undoubtful plus! Before planning old data digitizing, one must be sure to have enough reliable & checked information, to put each digitized value to proper place at 4D time-space coordinates! Needed to take into account that the understanding of problem was different 70 years ago! Within ERA CLIM, we did preparation of new, value added, goal oriented data sets, rather than simple digitizing Two different technologies were implemented: for OCR sources and for manual digitizing. Many elements of both technologies are identical. Reducing skepticism (increasing optimism!) on OCR is one of pluses and one of lessons from ERA CLIM 1/1/2019 ERA CLIM ASSEMBLY READING 2013

17 Conclusions and lessons
The solutions of pre-QC are specific, as soon as kinds of errors in digitizing process are specific Some elements of technologies are based on very simple solutions (e.g. conditional formatting in EXCEL). But these solutions are effective! Other elements of technologies are more complicated than could be supposed (e.g. time transformations). But it is needed, to resolve specific problems in a correct way! How universal is this experience? Depending on certain specific of hardcopy publications, technologies or their elements could be used in future 1/1/2019 ERA CLIM ASSEMBLY READING 2013

18 Questions, comments? Thank you! 1/1/2019
ERA CLIM ASSEMBLY READING 2013


Download ppt "Alexander Sterin, Dmitri Nikolaev, RIHMI-WDC"

Similar presentations


Ads by Google