Big Data and Programming (History 9808A) 27 October 2014.

Slides:



Advertisements
Similar presentations
Historical Population Register for Norway We are currently building a national Historical Population Register (HPR) for Norway based on mainly censuses.
Advertisements

Digital Data Representation
Big Data and Programming 4 February Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.
An Automated Record Linkage System for the Canadian Census, L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)
Economic Opportunity and Spatial Mobility in Britain, Canada and the United States, Lisa Dillon, Département de Démographie, Université de Montréal.
CEN 226: Computer Organization & Assembly Language :CSC 225 (Lec#1) By Dr. Syed Noman.
Number Systems & Logic Gates Day 1
Codes and number systems Introduction to Computer Yung-Yu Chuang with slides by Nisan & Schocken ( ) and Harris & Harris (DDCA)
Digital Data Patrice Koehl Computer Science UC Davis.
CREATED BY, MS. JENNIFER DUKE BITS, BYTES, AND UNITS OF MEASUREMENT.
Computer System Alanoud Al Saleh. Computer systems Are defined as: A machine for solving problems. Specifically the modern computer is high-speed electronic.
Bits, Bytes, KiloBytes, MegaBytes, GigaBytes & TeraBytes.
Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.
Computer System.
Data Representation A series of eight bits is called a byte. A byte can be used to represent a number or a character. As you’ll see in the following table,
Hardware Data Storage.
Advanced Diploma 1 Backing Storage. Advanced Diploma 2 Aims Understand how data is stored Be able to use the binary system to represent ASCII characters.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
COMPUTER TECHNOLOGY MRS. SEALE COMPUTER PERFORMANCE.
Inside your computer. Hardware Review Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Inside your computer. Hardware Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Digital Literacy Lesson 3. The Role of Memory A computer stores data in the memory when a task is performed. Data is stored in the form of 0s and 1s.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Unit 2—Part A Computer Memory Computer Technology (S1 Obj 2-3)
Bits and Bytes IGCSE. A binary number is either a 0 or a 1 and is known as a 'bit' or b inary dig it. However, the CPU cannot deal with just one bit at.
Do it now activity Can you work out what the missing symbols are and work out the order they should be in if the table shows smallest to largest KB kilobyte.
General Computer Stuff Hardware: physical parts of a computer: CPU, drives, etc. Software: Programs and Data A computer needs both to be useful.
Networking for Home and Small Businesses –.  Explain the binary representation of data.
Thursday 8 th October, 2015 Information Technology Fundamentals of Hardware & Software.
Computer Math CPS120: Binary Representations. Binary computers have storage units called binary digits or bits: Low Voltage = 0 High Voltage = 1 all bits.
Basic Computer Organization Rashedul Hasan.. Five basic operation No matter what shape, size, cost and speed of computer we are talking about, all computer.
Measuring Memory and Storage
2.1.4 Data Representation Units.
Know what a computer is used for Understand the difference between hardware and software Be able to describe the way that data is stored in a computer.
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis.
HIST*4170 Data: Big and Small 29 January Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
HNC COMPUTING - COMPUTER PLATFORMS 1 Computer Platforms Week 2 Backing Storage.
Binary Numbers. Base 10 and Base 2  We normally work with numbers in base 10.  In base 10 we use the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.  Everything.
Once you have been through these notes you will need to complete the workbook.
CC111 Lec#2 The System Unit The System Unit: Processing and Memory Lecture 2 Binary System.
WHAT IS CLOUD COMPUTING? Pierce County Library System.
Numerical Representation Intro to Computer Science CS1510 Dr. Sarah Diesburg 1.
© OCR 2016 Unit 2.6 Data Representation Lesson 1 ‒ Numbers.
Unit 2.6 Data Representation Lesson 1 ‒ Numbers
Computer basics.
Section 2 Terms Autumn Buchsenschutz.
How Computers Store Variables
Data Representation N4/N5.
Storage Hardware This icon indicates the slide contains activities created in Flash. These activities are not editable. For more detailed instructions,
Computer Memory Digital Literacy.
Numerical Representation
Memory Parts of a computer
What is Binary? Binary is a two-digit (Base-2) numerical system, which computers use to process and store data. The reason computers use the binary system.
Digital Information Fluency
Unit 2.6 Data Representation Lesson 1 ‒ Numbers
Unit 2 Computer Memory Computer Technology (S1 Obj 2-3)
COMPUTER MEMORY & DATA STORAGE
COMPUTER MEMORY & DATA STORAGE
Connected sources and available data
How do computers work? Storage.
Numerical Representation
Numerical Representation
All assignments and information is posted on web site
Ms Jennifer - Senior 4 - Data Representation Introduction
Basic Computer Organization
Data Analysis and R : Technology & Opportunity
Numerical Representation
Presentation transcript:

Big Data and Programming (History 9808A) 27 October 2014

Today’s Agenda  Proposals  How are we with the due date?  A Short Introduction to Big Data  A Big Data Project: People In Motion

Data Deluge  Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terabyte, petabyte, exabyte, zettabytes....  Library of Congress = 200 terabytes  “Transferring “Libraries of Congress” of Data” “Transferring “Libraries of Congress” of Data”  IP traffic is around 667 exabytes  It’s a deluge...  “Big Data”  too large for current software to handle  Don’t be intimidated  Not all DH sources (yet)  Instructive video – David McCandless, “The Beauty of Data VisualizationDavid McCandless, “The Beauty of Data Visualization

Big Data for History  Tools for journalists, lit scholars and others  Where does history fit in?  “Digital history does not offer truths, but only a new way of interpreting and understanding traces of the past.” (S. Graham, I. Milligan, & S. Weingart)  Blog Leaders  Taryn  “…we have to have a better understanding of how programming works so we can at least engage with Computer Scientists to help develop the complex systems required…”  Tamar  The Strange Case of Belgium/Ancestry.com  Nick K.  The Case of the Missing API

New approach: Crowdsourcing  An “online, distributed problem-solving and production model.”  Examples:  WikipediaWikipedia  reCAPTCHA reCAPTCHA  Luis von Ahn Luis von Ahn  Others...  Transcribe Bentham  Census transcription

A Database for Your Project?  Think about how you might use a database  but perhaps not too big!  Databases can be very small and still be DH-worthy  Are there public docs out there that you can digest?  Google Refine  Incorporate a search function into your website?  Resources  MS Excel (spreadsheet)  MS Access (relational database)  Google Refine Google Refine  Cleaning data

People in Motion: Longitudinal Data from the Canadian Census A Big Data Project at the University of Guelph

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Stage 1: 1871 to % of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Teaching a Computer to be a genealogist  Training with existing manually-created (True) links  Ontario Industrial Proprietors – 8429 links  Logan Township – 1760 links  St. James Church, Toronto – 232 links  Quebec City Boys – 1403 links  Bias concerns  Think of any? Logan Twp Guelph

Attributes for Automatic Linking  Last Name – string  First Name – string  Gender – binary  Birthplace – code  Age – number  Marital status – single, married, divorced, widowed, unknown

Automatic Linkage  The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense  The system:

Data Cleaning and Standardization  Cleaning  Names – remove non-alpha numerical characters; remove titles  Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);  All attributes - deal with English/French notations (e.g. days/jours, married/mariee)  Standardization  Birthplace codes and granularity  Marital status

Computational Expense  Very expensive to compare all the possible pairs of records  Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)  Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense  Blocking  By first letter of last name  By birthplace  Using HPC  Running the system on multiple processors in parallel

Record Comparison  Comparing Strings  String measures:  First letter, “edit Distance”, sound  Age  +/- 2 years  Required exact matches  Gender  Birthplace

Linkage Results   Over 500,000 links…  About 20%

Coding Playtime  W3C tutorials W3C tutorials  The Programming Historian The Programming Historian   Codeacademy Codeacademy 