Reconstructing historical populations from genealogical data An overview of methods used for aggregating data from GEDCOM files Corry Gellatly Department.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Art Foundations Exam 1.What are the Elements of Art? List & write a COMPLETE definition; you may supplement your written definition with Illustrations.
Use Case Diagrams.
QUALITY CONTROL TOOLS FOR PROCESS IMPROVEMENT
Author: Graeme C. Simsion and Graham C. Witt Chapter 7 Extensions and Alternatives.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group
Tracking and Data Management Technical Assistance Workshop for Universal Newborn Hearing Screening and Intervention Margaret Lubke, Ph.D. National Center.
1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine
1 Data Linkage Strategies Shihfen Tu, Ph.D. University of Maine
1 WIPO/TDS, Geneva, February 21, 2005 Search Guidance IPDL Presentation PCT/MIA/11 February 21, 2005.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
1.How long have you been married? 2.What is the best part about being married? 3.What is the most challenging part of being married? 4.How do you resolve.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 10 second questions Numbers and number system Numbers and the number system, fractions, decimals, proportion & probability.
Chapter 12 Analysing quantitative data
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
The MDGs and School Enrolment: An example of administrative data
SADC Course in Statistics Objectives and analysis Module B2, Session 14.
1 Making Changes to Existing Name and Work/Expression Authority Records Module 7. Making Changes to Existing Name and Work/Expression Authority Records.
How To Use OPAC.
The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey Adela Luque (U.S. Census Bureau) Brittany.
Report Card P Only 4 files are exported in SAMS, but there are at least 7 tables could be exported in WebSAMS. Report Card P contains 4 functions: Extract,
Multilevel Event History Modelling of Birth Intervals
1 How to Enter Time. 2 Select: Log In Once logged in, Select: Employees.
What is Event History Analysis?
Configuration management
Organisation Of Data (1) Database Theory
© Paradigm Publishing, Inc Access 2010 Level 1 Unit 1Creating Tables and Queries Chapter 2Creating Relationships between Tables.
Introducing WebDewey 2.0. Introducing WebDewey 2.0.
5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:
1 Epidemiologic Measures of Association Saeed Akhtar, PhD Associate Professor, Epidemiology Division of Epidemiology and Biostatistics Aga Khan University,
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Labour Force Historical Review Sandra Keys, University of Waterloo DLI OntarioTraining University of Guelph, Guelph, ON April 12, 2006.
1 Sources & Notes Documentation & Analysis for Family History Records Sources & Notes Documentation & Analysis for Family History Records Colin A Ackehurst.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Addition 1’s to 20.
Key Stage 3 National Strategy Handling data: session 4.
25 seconds left…...
School Census Summer 2010 Headlines 1 Jim Haywood Product Manager for Statutory Returns Version 1.0.
Test B, 100 Subtraction Facts
Week 1.
Using MyJob for Annual Benefits Enrollment Sign into MyJob doej PasswordUser NamePress Login button.
United Nations Population Division, Demographic dynamics of youth POPULATION DIVISION DESA.
We will resume in: 25 Minutes.
US Berkeley 2/12/2013 linking population-based data to child welfare records: a public health approach to surveillance Emily Putnam-Hornstein, PhD University.
Social Care Census & Mental Health Benchmarking - CHI Seeding 27th February 2014 – Social care event Atlantic Quay Euan Patterson.
Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Search for Predictors of Exceptional Human Longevity: Using Computerized Genealogies and Internet Resources for Human Longevity Studies Natalia S. Gavrilova,
CENTER FOR SOCIAL SERVICES RESEARCH School of Social Welfare, UC Berkeley Black/White and Black/Hispanic Racial Disparity in Child Welfare: Controlling.
Identity in the Census Finding people in more than one.
Using Historical Records to Reconstruct Early Life SES Exposures in Decedents: Preliminary Findings from a Pilot Study Kathryn Rose and J. Stephen Perhac.
Software. Records Fields Each record is made up of fields – categories of information. The fields here are Name, Surname, Address, Telephone and Date.
2006 Annual Meeting of the Gerontological Society of America, Dallas, TX New Approach to Study Determinants of Exceptional Human Longevity Dr. Leonid A.
Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and Ryan Yamagata Introduction Description of Probabilistic.
19 th Nov 2008 U3A Family History Group Developing your Family History Research -Part 2. Parish Records & the IGI.
Is retention on ART underestimated due to patient transfers
Data quality 1: Individual records
Presentation transcript:

Reconstructing historical populations from genealogical data An overview of methods used for aggregating data from GEDCOM files Corry Gellatly Department of History and Art History Utrecht University Workshop on Population Reconstruction IISH Amsterdam, February 2014

1. Overview  Why build a large genealogical database by aggregating hundreds of genealogical data (GEDCOM) files?  Research increasingly requires big data, to:  Understand large-scale population dynamics  between regions  over time  social, biological, cultural and economic aspects  Detect rare or ‘small-effects’  epidemiology (disease and intervention)  inheritance (genetics)  comparative life histories

2. GEDCOM files  Why use GEDCOM files for population reconstruction?  Pros  a standard file structure for representing information about familial relationships and life events  most popular format for storage and exchange of genealogical data  used internationally and widely available online  Cons  it is a highly flexible format that allows users to enter wildly incorrect information (if they wish to)

3. Data aggregation  Single GEDCOM files typically contain only a few hundred individuals, so we import hundreds of files into a single genealogical database  There are broadly 3 steps between import of files and the output, which is usable research datasets 1.Screening (to reject poor quality files) 2.Data cleaning 3.Linkage / de-duplication

4. Screening  Screening is carried out for various errors, e.g.  low mean number of offspring per family  individuals younger than 0 or very old (>110)  impossible relationships (due to age difference between individuals)  individuals occurring as different sexes  missing individuals  If errors are detected, then the file is either:  removed (in the case of obvious errors)  retained for further checking (in the case of ambiguous errors): e.g. where individuals have more than two parents – this can be due to adoption or incorrect family links between individuals

5. Cleaning  Example: date errors  If DOB is 1857  Born to 10 year old mother?  Wife 17 years older?  First of 5 children born at the age of 39?  If DOB is actually 1875  Born to 28 year old mother?  Wife 1 year younger?  First of 5 children born at the age of 21?

6. Dataset extraction  Definition of datasets is driven by research questions:  which timespan?  which region?  do we need complete families?  do we need dates of birth, death, marriage?  The identification of links between genealogies (or removal of duplicate individuals) is done during the process of dataset extraction

7. Linkage, de- duplication  Linkage fields  Day of birth, marriage or death (DOB, DOM, DOD)  Year of birth, marriage or death (YOB, YOM, YOD)  Surname  Given names  Sex  Problems  YOB, YOM, YOD more common than DOB, DOM, DOD (particularly in older data) but less unique to each individual  High inconsistency in recording of given names  Middle names included or excluded  Middle names used instead of first names  Abbreviated names  Nicknames (sometimes in brackets)

8. Linkage, de- duplication  T rade-off between data coverage and quality  Surname, given name, DOB  Low risk of false linkages, but high risk of missing linkages (due to problems with given names) and low data coverage  Surname, DOB  Low risk of false linkages, but low data coverage  Surname, YOB  High risk of false linkages, but high data coverage

9. Group- linking method  I ndividuals are identifiable by those they are related to  This principle is being applied to the problem of genealogical data, in which many records have YOB, but not DOB and given names are somewhat unreliable for linking  Group-linking string

10. Group-link test  T est with single GEDCOM file containing no duplicates 2,082 individuals; 971 marriages; 681 conceptive relationships; 1,913 conceptions

11. Group-link test  Percentage data coverage x Percentage of unique records within that data (÷ 100) gives an estimation of linkage power

12. Missing data  What about missing information?  The information on the siblings of these individuals is probably missing. Why? Because they appeared at marriage  This data is left censored, because these individuals appeared in the data after the event we are measuring (i.e. number and sex of siblings).

13. Missing data  Depending on what type of links we are trying to find, we may want to break up the string  String to link individuals based on their siblings  String to link individuals based on their marriages and children

14. Record de- duplication ( )  De-duplication of 17th century records from the genealogical database  Febrl program (Freely Extensible Biomedical Record Linkage)  17,488 records with Surname and YOB  Indexes  Surname > YOB  Surname > Group-link string 2 (sex + siblings)  Surname > Group-link string 3 (sex + marriages + offspring)  Comparison function  Winkler  Classifier  KMeans

15. Record de- duplication ( )  Results

16. Record de- duplication ( )  Results  Examples of matches in highest weight category (1,914 matches)

17. Record de- duplication ( )  Results  Examples of matches in lower weight category (10,434 matches)

17. Further work  Record linkage  Refine a method of probabilistic data matching that can identify linkages  where typo errors or name variations occur  possible date typos exist  there are missing persons in the family structure  Group-linking algorithm  Using the group-linking string as a start point to then check for existence of birth, marriage and death dates of relatives (where these exist) and performing matches on these variables  Inherently based on probabilistic matching

18. Acknowledgements  Netherlands Organisation for Scientific Research (NWO)  Project number : “Nature or nurture? A search for the institutional and biological determinants of life expectancy in Europe during the early modern period”  Colleagues at Utrecht University  Tine De Moor  Institutions for Collective Action team: