Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.

Slides:



Advertisements
Similar presentations
Microsoft® Access® 2010 Training
Advertisements

CC SQL Utilities.
Lesson 12 Getting Started with Excel Essentials
Microsoft Office XP Microsoft Excel
Obituaries Indexing with Family Search is vital to the work of salvation February 2014.
INSERT A SYMBOL Determine the Symbol to insert Determine the Symbol to insert Computer don’t just work with letters and numbers. In the global economy.
Lesson #3 Merge Duplicates, Edit Info, Establish Relationships.
Annotation & Nomenclature By Corey Fortezzo for PG&G GIS Workshop, 2010.
Key Applications Module Lesson 12 — Word Essentials
Ground-truthing Obituaries. Project Overview Untapped sources – Obituaries: hundreds of millions – Problem: how to cost-effectively extract Extraction.
Plex Training. 2 Course Objectives Learn how to Log on and Change Passwords in Plex Learn the Common Functions on the Control Panel Learn how to Log into.
Other Features Index and table of contents Macros and VBA.
Merging Duplicate Records in Family Tree. Duplicate records – why not just delete one of them? This record for Elizabeth Berry shows her as the child.
Word Processing basics
How to Check Books In and Out (as well as other library materials) A tutorial for librarians who use Agent VERSO.
ISI Web of Science Training Workshop Louisa Lam Medical Librarian 25 January 2005.
If you are very familiar with SOAR, try these quick links: Principal’s SOAR checklist here here Term 1 tasks – new features in 2010 here here Term 1 tasks.
SMART Agency Tipsheet Staff List This document focuses on setting up and maintaining program staff. Total Pages: 14 Staff Profile Staff Address Staff Assignment.
Classroom User Training June 29, 2005 Presented by:
XP New Perspectives on Microsoft Access 2002 Tutorial 51 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and.
Address Refer to Slide 2 for instructions on how to view the full-screen slideshow.Slide 2.
Atlas.ti Training Manual Part 3: Quotations. 2 PART 3: QUOTATIONS What is a Quotation? A Quotation (or Quote) is a.
Adding Custom Tags Types by Janis Parkison Rodriguez Arlington RUG Meeting 13 August 2011 Chapter 14 of Terry Reigel’s A Primer for The Master Genealogist.
Microsoft Office 2003 Illustrated Brief Document Creating a.
Chapter 2 Creating a Research Paper with References and Sources Microsoft Word 2013.
Introduction to fertility In Demography, the word ‘fertility’ refers to the number live births women have It is a major component of population change.
Microsoft Access Get a green book. Page AC 2 Define Access Define database.
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: how to cost-effectively extract Extraction.
Word Chapter 2 Review. MLA and APA Two styles used today for documenting references.
So – You want to learn how to put a BLOG article onto the state website. (Note: If you have not done so, you will need to review the web training provided.
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.
Microsoft ® Outlook ® 2010 Training Mailbox management 1: Creating folders.
Examining data using Microsoft Access Queries Using Criteria and Calculations SESSION 3.2 This section covers specifying an exact match condition in a.
SRM Free Text Carts SRM_SHO_303 SRM Free Text Carts.
Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.
Enhancing Forms with OLE Fields, Hyperlinks, and Subforms – Project 5.
Access Forms and Queries. Entering Data in Your Table  You can add data to your table in Datasheet view, by typing in the columns and rows.  This.
Early Childhood Outcomes Indicator 7 Data Collection Application Review.
Darek Sady - Respondus - 3/19/2003 Using Respondus Beginner to Basic By: Darek Sady.
Web Design-Lecture3-QN-2003 Web Design Enhancing a Website.
Human Resources 1 G-Top Global Workflow Employee View September 2014.
ENDNOTE X7 ….. Bibliographies Made Easy RESEARCH SUPPORT DIVISION PERPUSTAKAAN SULTANAH ZANARIAH.
Typing and Formatting a Research Paper WORD 2013.
PRESERVING YOUR PAST AND YOUR PRESENT FOR THE FUTURE.
Microsoft Word Level 1 Michael Carco. Word Level 1 Agenda  Word Basics  Navigating in a Document  Inserting and Modifying Text  Creating and Modifying.
 Given live by a presenter  Played without a presenter on a computer screen or on the Web  Slides provide a way to use text and graphics to introduce.
UsersTraining StatisticsCommunication Tests Knowledge Board Welcome to the Knowledge Board interactive guide! We encourage you to start with a click on.
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
Work with Tables and Database Records Lesson 3. NAVIGATING AMONG RECORDS Access users who prefer using the keyboard to navigate records can press keys.
The Excel model for information processing The Excel model is a grid of cells in which items of information are stored and processed. Any information that.
Introduction to KE EMu Unit objectives: Introduction to Windows Use the keyboard and mouse Use the desktop Open, move and resize a.
Introduction to KE EMu Unit objectives: Introduction to Windows Use the keyboard and mouse Use the desktop Open, move and resize a.
Access Queries and Forms. Adding a New Field  To insert a field after you have saved your table, open Access, and open the table  It is easier to add.
Scanned Books: Annotator Training. Project Overview Untapped sources – 200,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.
Make Our Work Count! A guide to entering data into CSE to ensure proper reporting on the CS1257 Brought to you by the number 16.
IS OPEN THE LIBRARY Polaris ILS Patron Services 5.0 SP3 Training.
Key Applications Module Lesson 12 — Word Essentials Computer Literacy BASICS.
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.
Extracting Data Automatically from Scanned Books with OntoSoar
KARES Demonstration.
Instructions for COMET Users
Vision for an Automatically Constructed FH-WoK
Module 5: Data Cleaning and Building Reports
A service provided by the Blessed Family Association
Chapter 2 Creating a Research Paper with References and Sources
Adjudicator Instructions
Temple Ready within an Hour of Collection Capture
Leslie Chavez and Will Bardé
Presentation transcript:

Scanned Books: Annotator Training

Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools – Read and do form-fill type-in – Form-fill by clicking Copy/paste & correction Family tree construction by inference – Synergistic Automated form-fill with user correction Manual specification of rules (FROntIER) Machine-learned extraction rules – Discover author-specified patterns (ListReader) – Parse sentences & match concepts (OntoSoar) – Learn from observing users work (GreenFIE-HD) Ground truthing

3

4

Read and Do Form-fill Type-in 5

Form-fill: Click-only 6

Synergistic: Automatic Form-fill with Human Confirmation/Correction 7    

Demo Annotator Framework – Session initialization/save/termination – Page mode/magnification/navigation Form Fill-in – Person – Couple (Marriages) – FamilyGroup (Parents with Children) 8

Session Initialization/Save/Termination 9 Navigate to “dithers.cs.byu.edu/bookannotator” and login: Select page to annotate: You will be given several username-password pairs. Each is associated with an annotation job to do. Save / Continue to Next Page: When you have finished all assigned pages and have saved your work, you need do nothing more. To start another job, navigate to “dithers.cs.byu.edu/fhannotator”

Page Mode/Magnification/Navigation 10 go to previous page, next page magnify: zoom in and out mode bounding box scroll bars

Rules and Hints for All Forms Rules 1.Record only typeset information (nothing written by hand). 2.Do not fix errors (OCR, typesetting, misspellings, …). 3.Close up words with end-of-line hyphens unless the hyphen is “real.” 4.For items that cross page boundaries, extract complete records with the first page. (If no previous/subsequent page, extract partial records.) 5.Use click, Alt-click, or mouse-drag-select-and-click only (to the extent possible—should always be possible). Hints 1.For click and Alt-click, hold down Ctrl to add tokens to a field. (Sometimes a click doesn’t “take”; watch for character bounding boxes & click again.) 2.To edit a field value, click on the field and edit; then Esc when done. 3.The field focus changes automatically; to change manually, use Tab to go forward and shift-Tab to go backward or just click on the field. 4.Be familiar with all Actions and use Keyboard Shortcuts. 11

Rules and Hints for All Forms Rules 1.Record only typeset information (nothing written by hand). 2.Do not fix errors (OCR, typesetting, misspellings, …). 3.Close up words with end-of-line hyphens unless the hyphen is “real.” 4.For items that cross page boundaries, extract complete records with the first page. (If no previous/subsequent page, extract partial records.) 5.Use click, Alt-click, or mouse-drag-select-and-click only (to the extent possible—should always be possible). Hints 1.For click and Alt-click, hold down Ctrl to add tokens to a field. (Sometimes a click doesn’t “take”; watch for character bounding boxes & click again.) 2.To edit a field value, click on the field and edit; then Esc when done. 3.The field focus changes automatically; to change manually, use Tab to go forward and shift-Tab to go backward or just click on the field. 4.Be familiar with all Actions and use Keyboard Shortcuts. 12

Record only typeset information (nothing written by hand). 13 This, not that

Do not fix errors (OCR, typesetting, misspellings, …). 14

Close up words with end-of-line hyphens unless the hyphen is “real.” 15 Click on “Latter-” or “day” in: “Latter- day Saints” also yields “Latterday”, but Alt-click yields “Latter-day”. Use Alt-click to retain the “real” hyphen. Click on “McKen-” or on “zie” properly extracts all of “McKenzie”.

For items that cross page boundaries, extract complete record with the first page. (If no previous/subsequent page, extract partial records.) 16 page 1 page 2 record together with first page (page 418)

Rules and Hints for Person Form Rules 1.Extract only names that have either associated birth or death information. 2.Get full name, including any punctuation, title(s) and suffix, but not non- name components associated with the name such as possessives (i.e., ’s). 3.Extract names as written (e.g., not implied surnames or maiden names). 4.Get full date and place names, including punctuation. 5.Do not extract implied dates and place names (e.g., not birth date when only age and death date appear and not place names unless explicitly stated as birth or death places) 6.Resolve each pronoun or name designator (e.g., “the family patriarch”) that links to birth or death information to the nearest preceding name to which it refers. Hints 1.For names and dates with punctuation, use Alt-click. 2.Use Ctrl-click to append name parts. 3.The Keyboard Shortcut “a” to add a record may be useful. 17

Extract only names that have either associated birth or death information. 18 not these names, since no birth or death information is associated with them this place but not these places this date but not this date Note that although Mary Augusta Andrus and Mrs. Lathrop are the same person, the birth and death information should be linked to the name in the sentence or phrase in which it appears. …

Get full name, including any punctuation, title(s) and suffix. Omit possessives. 19 Isaac Steel, Sr. Azubah Tully Joel M. Gloyd Chief Justice Waite David Vance CALL Rex – omit the surname, “Call” (not written with the name “Rex”) Arta (Shippee) Call – include the parentheses Jolayne Lois SILLITO omit the apostrophe “s”

Omit non-name components. 20 not embedded reference markers not names used for internal designators not paragraph headers

Extract names as written (e.g., not implied surnames or maiden names). 21 not “Abigail Huntington Lathrop McKenzie” not “Mary Ely McKenzie” not “Gerard Lathrop McKenzie” just the names as written

Get full date and place names, including punctuation. 22 include date modifiers not date modifiers, not date explanations (do not include) days of the week (do not include) punctuation part of date (include) punctuation not part of date (exclude) punctuation part of place (include) punctuation not part of place (exclude)

Resolve each pronoun or name designator that links to birth or death information to the nearest preceding name to which it refers. 23

Resolve each pronoun or name designator that links to birth or death information to the nearest preceding name to which it refers. 24

Resolve each pronoun or name designator that links to birth or death information to the nearest preceding name to which it refers. 25

Rules and Hints for Couples Form Rules 1.Record all couples as marriages, both stated and implied (e.g., if A is mentioned as the son of B and C, then record B and C as being married). 2.Record marriages with respect to a person. Either spouse may be the primary person. 3.Make a person with multiple marriages be the primary person and list each spouse with the primary person. 4.Extract names as specified for the person form—full name including punctuation, but only the name as written, not including implied maiden and surnames. 5.Resolve each pronoun or name designator (e.g., “his widow”) that links to marriage information to the nearest preceding name to which it refers. Hints 1.For multiple marriages, count the number of additional spouses and create additional nested records with a number key—1 to add one more spouse, 2 to add two additional spouses, etc. 2.Since the primary spouse can be either the husband or the wife, record names in the order they appear in the document. 26

Record all marriages, both stated and implied. 27 stated implied names, as written (here, the maiden name only—the implied married name not included, e.g. “Mary Ely”, not “Mary Ely Lathrop”)

Make a person with multiple marriages be the primary person and list each spouse with the person. 28 Christopher with three marriages

29 In this example, pronoun references to spouses are easily resolved, but the resolution of the person designator “his widow” as the spouse of Jonathan Squires requires a deeper understanding of the text. Resolve each pronoun and person designator that links to marriage information to the nearest preceding name of the person to which it refers.

Rules and Hints for Children Forms Rules 1.Parents may be specified in either order—father first or mother first. 2.Parentage can sometimes be complex especially with multiple marriages and blended families. Writers are usually clear, but read carefully to correctly determine parentage. 3.Record families that extend across page boundaries with the first page. 4.Sometimes the same surname appears for every child. Be sure to properly include each separate surname with each separate name. 5.Resolve each pronoun or name designator that links to parent-child information to the nearest preceding name to which it refers. Hints 1.When the focus is on a nested list field, a number key, n, adds n more blank fields to the list. Count the number of children and add the right number of fields first, then fill them in (e.g., if there are 5 children, enter 4 to add 4 more fields for the children; for 24 children, enter 9, then 9 again, and finally 5). 2.Since the parents can be in either order, record names in the order they appear in the document. 30

Don’t forget children, not explicitly marked as “children”. 31

Determine correct parentage. 32 Note that Elizabeth died in 1871 and could not have been Francis’s mother.

33 Eve cannot be the mother of either of Christopher’s children since she died before they were born. Esther was Christopher’s wife at the time both children were born, so she is the likely mother. Mary became Christopher’s wife in 1798, after both children were born. Determine correct parentage.

Record children with families that extend across page boundaries with the first page. 34 record Christopher with parents on a previous page record children on a next page with this page no children, but don’t forget the “dau of” child record all six children in these two families, not forgetting the two “son of” children

Be sure to properly include each separate surname with each separate name. 35 For “Michael Lawrence KIRCHGESSNER”, click here, here, and here. For “Deborah Joan KIRCHGESSNER”, click here, here, and here.

Resolve each pronoun or name designator that links to parent-child information to the nearest preceding name to which it refers. 36 An understanding of the text (e.g., “by whom she had one son”) is sometimes required to link children to parents.

Good Luck! (our ancestors are waiting) 37