HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.

HIST*4170 Data: Big and Small 29 January 2013

Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special Guest: Dr. Rebecca Lenihan

Blog Highlights Ambition Consider scalability Consider source availability – local advantage? Keep your eye on the academic value What do you want to teach? Learn? Themes: war, sport, family, mapping Intellectual property/privacy Resources: Google Sketchup To make 3D buildings

Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference) Ian MilliganTri-U history conference “Big Data” too large for current software to handle Don’t be intimidated Not all DH sources (yet)

Introduction to Databases Database – a system that allows for the efficient storage and retrieval of information We associate with... Computers changed a lot Problems: organization and efficient retrieval Organization = requires data structure Efficient Retrieval = requires through algorithms Potential for Humanities?...new problems, questions visualization, and objects worthy of study and reflection.

Database Design The purpose of a database is to store information about a particular domain and to allow one to ask questions about the state of that domain. Relational databases are more efficient because they store information separately Attributes Relationships Quamen reading is a nice introduction Not as complicated as you might think, but following rules is important We will apply...

New approach: Crowdsourcing An “online, distributed problem-solving and production model.” Daren C. Brabham (2008), "Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies 14 (1): 75–90"Crowdsourcing as a Model for Problem Solving: An Introduction and Cases" Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...”WikipediaInternet reCAPTCHA Luis von Ahn Others... Google?

There are limitations... Organization Quality Control Selection

A Database for Your Project? Think about how you might use a database but perhaps not too big! Databases can be very small and still be DH-worthy Are there public docs out there that you can digest? Google Refine Incorporate a search function into your website? Resources MS Excel (spreadsheet) MS Access (relational database) Google Refine Cleaning data

Assignment for Next Week Reading: TBD (3D guns?) Help someone else out with their project Read their blog Comment and provide detailed feedback Find a collaborator?

People in Motion: Creating Longitudinal Data from Canadian Historical Census

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Current Work 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Existing (True) Links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links Bias concerns –family context –others? Logan Twp Guelph

Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced, widowed, unknown

Automatic Linkage The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense The system:

Data Cleaning and Standardization Cleaning –Names – remove non-alpha numerical characters; remove titles –Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); –All attributes - deal with English/French notations (e.g. days/jours, married/mariee) Standardization –Birthplace codes and granularity –Marital status

Computational Expense Very expensive to compare all the possible pairs of records Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense Blocking –By first letter of last name –By birthplace Using HPC –Running the system on multiple processors in parallel

Record Comparison Comparing Strings –Jaro-Winkler –Edit Distance –Double Metaphone Age –+/- 2 years Exact matches –Gender –Birthplace

Linkage Results ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45

HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.

Similar presentations

Presentation on theme: "HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.

Similar presentations

Presentation on theme: "HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special."— Presentation transcript:

Similar presentations

About project

Feedback