Download presentation
Presentation is loading. Please wait.
Published byCasey Bares Modified over 9 years ago
1
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010
2
Overview Problem Current solutions Our solution –Preprocessing (briefly, images only) –Pattern approach Future work
3
Problem
4
Process
5
Finding Names Name recognition in genealogical texts Focus: Lists, Directories
6
Finding Names It’s easy for us to spot names… But how does a computer do it? Which side was easier?
7
Finding Names Stanford Named Entity Recognizer Apache UIMA Framework CRF MEMM Natural Language Processing ?
8
BYU OntoES Ontology Extraction System Dictionary Regular Expressions
9
Part 1: Preprocessing
10
Ancestry.com Data Word text Word bounding boxes Genres: –Genealogical Books –City Directories –Yearbooks –Newspapers
11
Ancestry.com Data Inconsistent punctuation –Commas and periods –Present in some books, absent in others Word ordering issue –Only some books are affected –Bug in OCR/layout analysis
12
Word Order The Standing Committee -The Rev William Berrian D ident ; the Rev John McVickar D D D the Rev Pres- I Haioht D D Rev Samuel R Johnson Hoffman Secretary ; the the Hon Gulian C Verplanck D Benjamin D the Hon Mur- ray M Esq Floyd Smith Gouverneur Ogden The to the Esq General Convention -The Rev Edward Y bee D Deputies D the Rev William D Rev Francis Hig- L Hawks D D LL D the Creighton Rev D Vinton the D Hon Murray Hoffman the Hon John A Francis Dix Hon D the Luther Bradish the Hon Nathaniel S Benton
13
Word Order - Corrected The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D the Rev Benjamin I Haioht D D Secretary ; the Rev Samuel R Johnson D D the Hon Mur- ray Hoffman the Hon Gulian C Verplanck Gouverneur M Ogden Esq Floyd Smith Esq The Deputies to the General Convention -The Rev Edward Y Hig- bee D D the Rev William Creighton D D the Rev Francis L Hawks D D LL D the Rev Francis Vinton D D the Hon Murray Hoffman the Hon John A Dix the Hon Luther Bradish the Hon Nathaniel S Benton
14
Word Order Notice the imaginary green line… Some tokens extend below it – –These are pushed down to the next line! –This is a bug –Clearly, we can do better The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D D the Rev Pres- I Haioht
15
DEG/Ancestry OCR Reformatting TLP original reordering code Page separator Line segment identifier Line ordering RANSAC margin finder
16
Page Separator Looks for any place where a vertical line can cleanly separate the text Not robust to skew
17
Page Separator
18
Line Segment Identifier Combines words within about 2 spaces Handles skew reasonably well
19
Line Segment Identifier
20
Line Ordering Works well in most cases Excessive skew or overlap is harder
21
RANSAC Margin Finder Random Sampling with Consensus Finds a line in the presence of noise Effective for finding left-aligned margins, tab stops, table columns
22
RANSAC Margin Finder
23
Margin Finder – Future Work Left Center Right Key
24
Margin Finder – Future Work Line Wrap?
25
Margin Finder – Future Work ABBYY FineReader handles – –Paragraphs –Newspaper columns But has trouble with – –Hanging indents –Outline indentation (possibly)
26
Part 2: Pattern Finding
27
Pattern Finding 1.Apply baseline name extractor (OntoES) 2.Apply margin finder and insert markers 3.Find left and right context for each name 4.Apply common contexts to extract more names
28
Pattern Finding 1. Apply baseline name extractor (OntoES)
29
Pattern Finding LEVEL 1 LEVEL 2 2.Apply margin finder and insert markers
30
Pattern Finding LEVEL 1 LEVEL 2 3. Find left and right context for each name
31
Pattern Finding LEVEL 1 LEVEL 2 4. Apply common context patterns to extract more names
32
Pattern Finding – Sample Results Baseline Results Precision: 40% Recall: 31.25% F1: 35.09% Results of Most Salient Pattern Precision: 51.52% Recall: 53.12% F1: 52.31% Not all results are this good!
33
Output Java Advanced Imaging library –JPEG 2000 –TIF Bounding Boxes
34
Contributions Most of our work is just work –Practical –Not novel Possible exception: RANSAC for Margins –Current research topic (2009) –Ray Smith / Tesseract Possible exception: L/R patterns
35
Challenges Evaluation –More aligned data –Annotation tool Other books –Centered and right-aligned text –Knowing when to apply patterns
36
Challenges (continued) Additional Patterns –Not just city directories (too trivial?) –Include other books Extent Sanity check on non-pattern books
37
Possible Approach Publish the work as it is?... Add centering and right alignment Add another book…
38
My Preferred Approach Build a useful interactive tool Add features incrementally When my friends say “wow!”, it’s time to publish.
39
Future Challenges BYU Digital Collections –Not searchable for names –Needs further processing Etc.
40
Work to Do… Organize data into a collection Index it Provide a search interface
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.