Download presentation
Presentation is loading. Please wait.
Published byRoxanne Carter Modified over 9 years ago
1
Big Data and Programming 4 February 2015
2
Today’s Agenda A Short Introduction to Big Data A Big Data Project: People In Motion Next week Meet Monday here at 2:30 for ca. 60-75 minutes Meet Wednesday ca. 2:30-4:30 in Library 034a (north stairs, go to basement)
3
Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terabyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data” “Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... “Big Data” too large for current software to handle Don’t be intimidated Not all DH sources (yet)
4
Big Data for History Tools for journalists, literature scholars and others Where does history fit in? Graham, Milligan, & Weingart “Will Big Data have a revolutionary impact on the epistemological foundation of history?” Will it get us closer to the past? Networks A whole world of fun! Visualization is also a whole new world See: David McCandless, “The Beauty of Data Visualization See: David McCandless, “The Beauty of Data Visualization What does it tell us?
5
New approaches: Crowdsourcing An “online, distributed problem-solving and production model.” Examples: WikipediaWikipedia reCAPTCHA reCAPTCHA Luis von Ahn Luis von Ahn Others...
8
A Database for Your Project? Think about how you might use a database but perhaps not too big! Databases can be very small and still be DH-worthy Are there public docs out there that you can digest? Resources: Programming Historian Programming Historian MS Excel (spreadsheet), Access (relational database), Google RefineGoogle Refine
9
People in Motion: Longitudinal Data from the Canadian Census A Big Data Project at the University of Guelph
10
‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census
11
Stage 1: 1871 to 1881 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census
12
Teaching a Computer to be a genealogist Training with existing manually-created (True) links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links Bias concerns Think of any? Logan Twp Guelph
13
Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced, widowed, unknown
14
Automatic Linkage The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense The system:
15
Data Cleaning and Standardization Cleaning Names – remove non-alpha numerical characters; remove titles Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); All attributes - deal with English/French notations (e.g. days/jours, married/mariee) Standardization Birthplace codes and granularity Marital status
16
Computational Expense Very expensive to compare all the possible pairs of records Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
17
Managing Computational Expense Blocking By first letter of last name By birthplace Using HPC Running the system on multiple processors in parallel
18
Record Comparison Comparing Strings String measures: First letter, “edit Distance”, sound Age +/- 2 years Required exact matches Gender Birthplace
19
Linkage Results 1871-81-91-1901 Over 500,000 links… About 20%
20
Coding Workshop Go to http://www.codecademy.com/learnhttp://www.codecademy.com/learn Scroll down to “Goals” Pick one of the three activities Animate your Name About You Sun, Earth and Code After 30 minutes, be prepared to present!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.