Big Data and Programming (History 9808A) 27 October 2014.

Big Data and Programming (History 9808A) 27 October 2014

Today’s Agenda  Proposals  How are we with the due date?  A Short Introduction to Big Data  A Big Data Project: People In Motion

Data Deluge  Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terabyte, petabyte, exabyte, zettabytes....  Library of Congress = 200 terabytes  “Transferring “Libraries of Congress” of Data” “Transferring “Libraries of Congress” of Data”  IP traffic is around 667 exabytes  It’s a deluge...  “Big Data”  too large for current software to handle  Don’t be intimidated  Not all DH sources (yet)  Instructive video – David McCandless, “The Beauty of Data VisualizationDavid McCandless, “The Beauty of Data Visualization

Big Data for History  Tools for journalists, lit scholars and others  Where does history fit in?  “Digital history does not offer truths, but only a new way of interpreting and understanding traces of the past.” (S. Graham, I. Milligan, & S. Weingart)  Blog Leaders  Taryn  “…we have to have a better understanding of how programming works so we can at least engage with Computer Scientists to help develop the complex systems required…”  Tamar  The Strange Case of Belgium/Ancestry.com  Nick K.  The Case of the Missing API

New approach: Crowdsourcing  An “online, distributed problem-solving and production model.”  Examples:  WikipediaWikipedia  reCAPTCHA reCAPTCHA  Luis von Ahn Luis von Ahn  Others...  Transcribe Bentham  Census transcription

A Database for Your Project?  Think about how you might use a database  but perhaps not too big!  Databases can be very small and still be DH-worthy  Are there public docs out there that you can digest?  Google Refine  Incorporate a search function into your website?  Resources  MS Excel (spreadsheet)  MS Access (relational database)  Google Refine Google Refine  Cleaning data

People in Motion: Longitudinal Data from the Canadian Census A Big Data Project at the University of Guelph

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Stage 1: 1871 to 1881 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Teaching a Computer to be a genealogist  Training with existing manually-created (True) links  Ontario Industrial Proprietors – 8429 links  Logan Township – 1760 links  St. James Church, Toronto – 232 links  Quebec City Boys – 1403 links  Bias concerns  Think of any? Logan Twp Guelph

Attributes for Automatic Linking  Last Name – string  First Name – string  Gender – binary  Birthplace – code  Age – number  Marital status – single, married, divorced, widowed, unknown

Automatic Linkage  The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense  The system:

Data Cleaning and Standardization  Cleaning  Names – remove non-alpha numerical characters; remove titles  Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);  All attributes - deal with English/French notations (e.g. days/jours, married/mariee)  Standardization  Birthplace codes and granularity  Marital status

Computational Expense  Very expensive to compare all the possible pairs of records  Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)  Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense  Blocking  By first letter of last name  By birthplace  Using HPC  Running the system on multiple processors in parallel

Record Comparison  Comparing Strings  String measures:  First letter, “edit Distance”, sound  Age  +/- 2 years  Required exact matches  Gender  Birthplace

Linkage Results  1871-81-91-1901  Over 500,000 links…  About 20%

Coding Playtime  W3C tutorials W3C tutorials  The Programming Historian The Programming Historian  http://programminghistorian.org/ http://programminghistorian.org/  Codeacademy Codeacademy  http://www.codecademy.com/learn http://www.codecademy.com/learn

Big Data and Programming (History 9808A) 27 October 2014.

Similar presentations

Presentation on theme: "Big Data and Programming (History 9808A) 27 October 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data and Programming (History 9808A) 27 October 2014.

Similar presentations

Presentation on theme: "Big Data and Programming (History 9808A) 27 October 2014."— Presentation transcript:

Similar presentations

About project

Feedback