HIST*4170 Data: Big and Small 29 January 2013. Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.

Slides:



Advertisements
Similar presentations
Data Models There are 3 parts to a GIS: GUI Tools
Advertisements

Large Scale Computing Systems
1 Chapter 2 The Digital World. 2 Digital Data Representation.
The computer memory and the binary number system.
Big Data and Programming 4 February Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.
An Automated Record Linkage System for the Canadian Census, L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)
Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May.
Big Data and Programming (History 9808A) 27 October 2014.
Aki Hecht Seminar in Databases (236826) January 2009
Digital Data Patrice Koehl Computer Science UC Davis.
CREATED BY, MS. JENNIFER DUKE BITS, BYTES, AND UNITS OF MEASUREMENT.
History 9808A Digital History 8 September Today’s Agenda  Introductions  Me and the course  You  Digital History- What is it anyway?
Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? January Geoffrey Fox
1. Fundamentals of Computer Systems Define a computer system Computer Systems in the modern world Professional standards for computer systems Ethical,
MEASUREMENT PLAN SOFTWARE MEASUREMENT & ANALYSIS Team Assignment 15
Data Representation A series of eight bits is called a byte. A byte can be used to represent a number or a character. As you’ll see in the following table,
Hardware Data Storage.
Advanced Diploma 1 Backing Storage. Advanced Diploma 2 Aims Understand how data is stored Be able to use the binary system to represent ASCII characters.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
POPULATION AND HOUSING CENSUSES IN SLOVAKIA ON THE WEBSITE Miroslav Hudec Pavol Büchler INFOSTAT – Bratislava MSIS Geneva
CSCI 101 Final Exam Review Dannelly's Sections. This short overview is not intended to be a complete review for the final exam. Review your notes and.
COMPUTER TECHNOLOGY MRS. SEALE COMPUTER PERFORMANCE.
Getting to know Storage Media 1.Stores information 2.Retrieve information for later use.
Digital Literacy Lesson 3. The Role of Memory A computer stores data in the memory when a task is performed. Data is stored in the form of 0s and 1s.
Numerical Representation Intro to Computer Science CS1510, Section 2 Dr. Sarah Diesburg 1.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Do it now activity Can you work out what the missing symbols are and work out the order they should be in if the table shows smallest to largest KB kilobyte.
Institute for Personal Robots in Education (IPRE)‏ CSC 170 Computing: Science and Creativity.
Relational Databases. Relational database  data stored in tables  must put data into the correct tables  define relationship between tables  primary.
How We Measure Memory. Learning Goal Today we are going to learn how the computer stores information.
Thursday 8 th October, 2015 Information Technology Fundamentals of Hardware & Software.
Computer Math CPS120: Binary Representations. Binary computers have storage units called binary digits or bits: Low Voltage = 0 High Voltage = 1 all bits.
Basic Computer Organization Rashedul Hasan.. Five basic operation No matter what shape, size, cost and speed of computer we are talking about, all computer.
Data Representation.
2/20: Ch. 6 Data Management What is data? How is it stored? –Traditional management storage techniques; problems –DBMS.
Know what a computer is used for Understand the difference between hardware and software Be able to describe the way that data is stored in a computer.
HNC COMPUTING - COMPUTER PLATFORMS 1 Computer Platforms Week 2 Backing Storage.
Binary Numbers. Base 10 and Base 2  We normally work with numbers in base 10.  In base 10 we use the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.  Everything.
Once you have been through these notes you will need to complete the workbook.
CC111 Lec#2 The System Unit The System Unit: Processing and Memory Lecture 2 Binary System.
WHAT IS CLOUD COMPUTING? Pierce County Library System.
Numerical Representation Intro to Computer Science CS1510 Dr. Sarah Diesburg 1.
IP and MAC Addresses, DNS Servers
Computer basics.
Section 2 Terms Autumn Buchsenschutz.
How Computers Store Variables
How do Computers Work ?.
Data Representation N4/N5.
Storage Hardware This icon indicates the slide contains activities created in Flash. These activities are not editable. For more detailed instructions,
Numerical Representation
Memory Parts of a computer
What is Binary? Binary is a two-digit (Base-2) numerical system, which computers use to process and store data. The reason computers use the binary system.
Digital Information Fluency
G Suite Elevator Pitches
Fourth Session MR Computer Group 10/12/15
Connected sources and available data
How do computers work? Storage.
Numerical Representation
Bits, Bytes, and Storage.
From Problems to Algorithms to Programs
Numerical Representation
Course Introduction CSC 576: Data Mining.
All assignments and information is posted on web site
Basic Computer Organization
Data Analysis and R : Technology & Opportunity
Numerical Representation
Presentation transcript:

HIST*4170 Data: Big and Small 29 January 2013

Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special Guest: Dr. Rebecca Lenihan

Blog Highlights Ambition Consider scalability Consider source availability – local advantage? Keep your eye on the academic value What do you want to teach? Learn? Themes: war, sport, family, mapping Intellectual property/privacy Resources: Google Sketchup To make 3D buildings

Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference) Ian MilliganTri-U history conference “Big Data” too large for current software to handle Don’t be intimidated Not all DH sources (yet)

Introduction to Databases Database – a system that allows for the efficient storage and retrieval of information We associate with... Computers changed a lot Problems: organization and efficient retrieval Organization = requires data structure Efficient Retrieval = requires through algorithms Potential for Humanities?...new problems, questions visualization, and objects worthy of study and reflection.

Database Design The purpose of a database is to store information about a particular domain and to allow one to ask questions about the state of that domain. Relational databases are more efficient because they store information separately Attributes Relationships Quamen reading is a nice introduction Not as complicated as you might think, but following rules is important We will apply...

New approach: Crowdsourcing An “online, distributed problem-solving and production model.” Daren C. Brabham (2008), "Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies 14 (1): 75–90"Crowdsourcing as a Model for Problem Solving: An Introduction and Cases" Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...”WikipediaInternet reCAPTCHA Luis von Ahn Others... Google?

There are limitations... Organization Quality Control Selection

A Database for Your Project? Think about how you might use a database but perhaps not too big! Databases can be very small and still be DH-worthy Are there public docs out there that you can digest? Google Refine Incorporate a search function into your website? Resources MS Excel (spreadsheet) MS Access (relational database) Google Refine Cleaning data

Assignment for Next Week Reading: TBD (3D guns?) Help someone else out with their project Read their blog Comment and provide detailed feedback Find a collaborator?

People in Motion: Creating Longitudinal Data from Canadian Historical Census

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Current Work 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Existing (True) Links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links Bias concerns –family context –others? Logan Twp Guelph

Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced, widowed, unknown

Automatic Linkage The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense The system:

Data Cleaning and Standardization Cleaning –Names – remove non-alpha numerical characters; remove titles –Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); –All attributes - deal with English/French notations (e.g. days/jours, married/mariee) Standardization –Birthplace codes and granularity –Marital status

Computational Expense Very expensive to compare all the possible pairs of records Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense Blocking –By first letter of last name –By birthplace Using HPC –Running the system on multiple processors in parallel

Record Comparison Comparing Strings –Jaro-Winkler –Edit Distance –Double Metaphone Age –+/- 2 years Exact matches –Gender –Birthplace

Linkage Results ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45