Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Similar presentations


Presentation on theme: "Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010."— Presentation transcript:

1 Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010

2 Overview Problem Current solutions Our solution –Preprocessing (briefly, images only) –Pattern approach Future work

3 Problem

4 Process

5 Finding Names Name recognition in genealogical texts Focus: Lists, Directories

6 Finding Names It’s easy for us to spot names… But how does a computer do it? Which side was easier?

7 Finding Names Stanford Named Entity Recognizer Apache UIMA Framework CRF MEMM Natural Language Processing ?

8 BYU OntoES Ontology Extraction System Dictionary Regular Expressions

9 Part 1: Preprocessing

10 Ancestry.com Data Word text Word bounding boxes Genres: –Genealogical Books –City Directories –Yearbooks –Newspapers

11 Ancestry.com Data Inconsistent punctuation –Commas and periods –Present in some books, absent in others Word ordering issue –Only some books are affected –Bug in OCR/layout analysis

12 Word Order The Standing Committee -The Rev William Berrian D ident ; the Rev John McVickar D D D the Rev Pres- I Haioht D D Rev Samuel R Johnson Hoffman Secretary ; the the Hon Gulian C Verplanck D Benjamin D the Hon Mur- ray M Esq Floyd Smith Gouverneur Ogden The to the Esq General Convention -The Rev Edward Y bee D Deputies D the Rev William D Rev Francis Hig- L Hawks D D LL D the Creighton Rev D Vinton the D Hon Murray Hoffman the Hon John A Francis Dix Hon D the Luther Bradish the Hon Nathaniel S Benton

13 Word Order - Corrected The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D the Rev Benjamin I Haioht D D Secretary ; the Rev Samuel R Johnson D D the Hon Mur- ray Hoffman the Hon Gulian C Verplanck Gouverneur M Ogden Esq Floyd Smith Esq The Deputies to the General Convention -The Rev Edward Y Hig- bee D D the Rev William Creighton D D the Rev Francis L Hawks D D LL D the Rev Francis Vinton D D the Hon Murray Hoffman the Hon John A Dix the Hon Luther Bradish the Hon Nathaniel S Benton

14 Word Order Notice the imaginary green line… Some tokens extend below it – –These are pushed down to the next line! –This is a bug –Clearly, we can do better The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D D the Rev Pres- I Haioht

15 DEG/Ancestry OCR Reformatting TLP original reordering code Page separator Line segment identifier Line ordering RANSAC margin finder

16 Page Separator Looks for any place where a vertical line can cleanly separate the text Not robust to skew

17 Page Separator

18 Line Segment Identifier Combines words within about 2 spaces Handles skew reasonably well

19 Line Segment Identifier

20 Line Ordering Works well in most cases Excessive skew or overlap is harder

21 RANSAC Margin Finder Random Sampling with Consensus Finds a line in the presence of noise Effective for finding left-aligned margins, tab stops, table columns

22 RANSAC Margin Finder

23 Margin Finder – Future Work Left Center Right Key

24 Margin Finder – Future Work Line Wrap?

25 Margin Finder – Future Work ABBYY FineReader handles – –Paragraphs –Newspaper columns But has trouble with – –Hanging indents –Outline indentation (possibly)

26 Part 2: Pattern Finding

27 Pattern Finding 1.Apply baseline name extractor (OntoES) 2.Apply margin finder and insert markers 3.Find left and right context for each name 4.Apply common contexts to extract more names

28 Pattern Finding 1. Apply baseline name extractor (OntoES)

29 Pattern Finding LEVEL 1 LEVEL 2 2.Apply margin finder and insert markers

30 Pattern Finding LEVEL 1 LEVEL 2 3. Find left and right context for each name

31 Pattern Finding LEVEL 1 LEVEL 2 4. Apply common context patterns to extract more names

32 Pattern Finding – Sample Results Baseline Results Precision: 40% Recall: 31.25% F1: 35.09% Results of Most Salient Pattern Precision: 51.52% Recall: 53.12% F1: 52.31% Not all results are this good!

33 Output Java Advanced Imaging library –JPEG 2000 –TIF Bounding Boxes

34 Contributions Most of our work is just work –Practical –Not novel Possible exception: RANSAC for Margins –Current research topic (2009) –Ray Smith / Tesseract Possible exception: L/R patterns

35 Challenges Evaluation –More aligned data –Annotation tool Other books –Centered and right-aligned text –Knowing when to apply patterns

36 Challenges (continued) Additional Patterns –Not just city directories (too trivial?) –Include other books Extent Sanity check on non-pattern books

37 Possible Approach Publish the work as it is?... Add centering and right alignment Add another book…

38 My Preferred Approach Build a useful interactive tool Add features incrementally When my friends say “wow!”, it’s time to publish.

39 Future Challenges BYU Digital Collections –Not searchable for names –Needs further processing Etc.

40 Work to Do… Organize data into a collection Index it Provide a search interface


Download ppt "Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010."

Similar presentations


Ads by Google