Download presentation
Presentation is loading. Please wait.
1
Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University
2
ἀλήθεια truth Ἀλήθεια ‘Breathing’ marks on vowels at beginning of a word Accents possible on all vowels
3
Diversity of Greek Fonts in 19 th C.
4
Other Examples
5
Greek OCR With Gamera Dalitz and Brandt provide an experimental framework – I added splitting, grouping, sql output, etc. Teams of undergraduates making multiple classifiers – Based on families of fonts – Comparing strategies of composite characters, splitting, etc. – Must also train for Latin scripts used Not yet working on post-processing
6
Good Results
8
Systematic Approach to Automated Greek OCR Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means Using: – Federico Boschetti’s ground-truth-less Greek text evaluator – Atlantic Computational Excellence Network, Atlantic Canada’s parallel computing network
9
Process 160 Greek-heavy texts chosen Of these, random samples of 10 pages were taken Each was processed with each of the 20 classifiers made this summer The result were evaluated and given a ‘Boschetti score’ from 0 – 1
14
Google/ABBYY Line Splitting
15
Gamera’s Text Line Finding(bbox_merging)
16
Replaced with runlength_smearing
17
Two-step processing
18
Future Work Combining and re-optimizing classifiers? Assign classifier based on Latin text – Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output? Align with Google’s output, and provide Google with corrected Greek Implement line-splitting from other OCR engines Discover badly OCR’d Greek in others’ output Implement OCR correction frameworks described here
19
Common Problems Assessments of pre-processing strategies and tools Schemas for page description
20
Thanks Colleagues in Dynamic Variorum Editions: – Greg Crane at Perseus / Tufts – Brian Fuchs at Imperial College Federico Boschetti AceNet, especially tech. support of Sergiy Khan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.