Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Slides:



Advertisements
Similar presentations
Chapter 4: Representation of data in computer systems
Advertisements

Digital Data Representation
The computer memory and the binary number system.
An Automated Record Linkage System for the Canadian Census, L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)
Economic Opportunity and Spatial Mobility in Britain, Canada and the United States, Lisa Dillon, Département de Démographie, Université de Montréal.
Big Data and Programming (History 9808A) 27 October 2014.
Introducing Data to History Students A. Michelle Edwards, Ph.D. University of Guelph.
Codes and number systems Introduction to Computer Yung-Yu Chuang with slides by Nisan & Schocken ( ) and Harris & Harris (DDCA)
Digital Data Patrice Koehl Computer Science UC Davis.
CREATED BY, MS. JENNIFER DUKE BITS, BYTES, AND UNITS OF MEASUREMENT.
Bits, Bytes, KiloBytes, MegaBytes, GigaBytes & TeraBytes.
Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.
X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? January Geoffrey Fox
Computer System.
Chapter 2 Computer Hardware
Data Representation A series of eight bits is called a byte. A byte can be used to represent a number or a character. As you’ll see in the following table,
Advanced Diploma 1 Backing Storage. Advanced Diploma 2 Aims Understand how data is stored Be able to use the binary system to represent ASCII characters.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
Inside your computer. Hardware Review Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Inside your computer. Hardware Motherboard Processor / CPU Bus Bios chip Memory Hard drive Video Card Sound Card Monitor/printer Ports.
Digital Literacy Lesson 3. The Role of Memory A computer stores data in the memory when a task is performed. Data is stored in the form of 0s and 1s.
Numerical Representation Intro to Computer Science CS1510, Section 2 Dr. Sarah Diesburg 1.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
CS41B MACHINE David Kauchak CS 52 – Fall Admin  Assignment 3  due Monday at 11:59pm  one small error in 5b (fast division) that’s been fixed.
Unit 2—Part A Computer Memory Computer Technology (S1 Obj 2-3)
Bits and Bytes IGCSE. A binary number is either a 0 or a 1 and is known as a 'bit' or b inary dig it. However, the CPU cannot deal with just one bit at.
Do it now activity Can you work out what the missing symbols are and work out the order they should be in if the table shows smallest to largest KB kilobyte.
General Computer Stuff Hardware: physical parts of a computer: CPU, drives, etc. Software: Programs and Data A computer needs both to be useful.
Networking for Home and Small Businesses –.  Explain the binary representation of data.
Computer Software. 1.Name the 3 main types of software and describe how they are used. systems software : Includes the operating system and all the utilities.
Computer Math CPS120: Binary Representations. Binary computers have storage units called binary digits or bits: Low Voltage = 0 High Voltage = 1 all bits.
Basic Computer Organization Rashedul Hasan.. Five basic operation No matter what shape, size, cost and speed of computer we are talking about, all computer.
Data Representation.
Operating Systems & Applications Software Lesson 8.
2.1.4 Data Representation Units.
Know what a computer is used for Understand the difference between hardware and software Be able to describe the way that data is stored in a computer.
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis.
HIST*4170 Data: Big and Small 29 January Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.
HNC COMPUTING - COMPUTER PLATFORMS 1 Computer Platforms Week 2 Backing Storage.
Binary Numbers. Base 10 and Base 2  We normally work with numbers in base 10.  In base 10 we use the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.  Everything.
Once you have been through these notes you will need to complete the workbook.
CC111 Lec#2 The System Unit The System Unit: Processing and Memory Lecture 2 Binary System.
WHAT IS CLOUD COMPUTING? Pierce County Library System.
Numerical Representation Intro to Computer Science CS1510 Dr. Sarah Diesburg 1.
© OCR 2016 Unit 2.6 Data Representation Lesson 1 ‒ Numbers.
Computer basics.
Section 2 Terms Autumn Buchsenschutz.
How Computers Store Variables
How do Computers Work ?.
Data Representation N4/N5.
Storage Hardware This icon indicates the slide contains activities created in Flash. These activities are not editable. For more detailed instructions,
Computer Memory Digital Literacy.
Numerical Representation
Memory Parts of a computer
What is Binary? Binary is a two-digit (Base-2) numerical system, which computers use to process and store data. The reason computers use the binary system.
Unit 2.6 Data Representation Lesson 1 ‒ Numbers
Unit 2 Computer Memory Computer Technology (S1 Obj 2-3)
Representation of Data in Computer Systems
Connected sources and available data
How do computers work? Storage.
Numerical Representation
Bits, Bytes, and Storage.
Numerical Representation
All assignments and information is posted on web site
Basic Computer Organization
Technology 3 Bits & Bytes.
Data Analysis and R : Technology & Opportunity
CSE 102 Introduction to Computer Engineering
Numerical Representation
Presentation transcript:

Big Data and Programming 4 February 2015

Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet Monday here at 2:30 for ca minutes  Meet Wednesday ca. 2:30-4:30 in Library 034a (north stairs, go to basement)

Data Deluge  Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terabyte, petabyte, exabyte, zettabytes....  Library of Congress = 200 terabytes  “Transferring “Libraries of Congress” of Data” “Transferring “Libraries of Congress” of Data”  IP traffic is around 667 exabytes  It’s a deluge...  “Big Data”  too large for current software to handle  Don’t be intimidated  Not all DH sources (yet)

Big Data for History  Tools for journalists, literature scholars and others  Where does history fit in?  Graham, Milligan, & Weingart  “Will Big Data have a revolutionary impact on the epistemological foundation of history?”  Will it get us closer to the past?  Networks  A whole world of fun!  Visualization is also a whole new world  See: David McCandless, “The Beauty of Data Visualization See: David McCandless, “The Beauty of Data Visualization  What does it tell us?

New approaches: Crowdsourcing  An “online, distributed problem-solving and production model.”  Examples:  WikipediaWikipedia  reCAPTCHA reCAPTCHA  Luis von Ahn Luis von Ahn  Others...

A Database for Your Project?  Think about how you might use a database  but perhaps not too big!  Databases can be very small and still be DH-worthy  Are there public docs out there that you can digest?  Resources:  Programming Historian Programming Historian  MS Excel (spreadsheet), Access (relational database), Google RefineGoogle Refine

People in Motion: Longitudinal Data from the Canadian Census A Big Data Project at the University of Guelph

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Stage 1: 1871 to % of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Teaching a Computer to be a genealogist  Training with existing manually-created (True) links  Ontario Industrial Proprietors – 8429 links  Logan Township – 1760 links  St. James Church, Toronto – 232 links  Quebec City Boys – 1403 links  Bias concerns  Think of any? Logan Twp Guelph

Attributes for Automatic Linking  Last Name – string  First Name – string  Gender – binary  Birthplace – code  Age – number  Marital status – single, married, divorced, widowed, unknown

Automatic Linkage  The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense  The system:

Data Cleaning and Standardization  Cleaning  Names – remove non-alpha numerical characters; remove titles  Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);  All attributes - deal with English/French notations (e.g. days/jours, married/mariee)  Standardization  Birthplace codes and granularity  Marital status

Computational Expense  Very expensive to compare all the possible pairs of records  Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)  Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense  Blocking  By first letter of last name  By birthplace  Using HPC  Running the system on multiple processors in parallel

Record Comparison  Comparing Strings  String measures:  First letter, “edit Distance”, sound  Age  +/- 2 years  Required exact matches  Gender  Birthplace

Linkage Results   Over 500,000 links…  About 20%

Coding Workshop  Go to  Scroll down to “Goals”  Pick one of the three activities  Animate your Name  About You  Sun, Earth and Code  After 30 minutes, be prepared to present!