Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.

Slides:



Advertisements
Similar presentations
Unit 27 Spreadsheet Modelling
Advertisements

1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 5 Analyzing.
Spreadsheets With Microsoft Excel ® as an example.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
MR2300: MARKETING RESEARCH PAUL TILLEY Unit 10: Basic Data Analysis.
Concepts of Database Management Sixth Edition
Chapter 12 Information Systems. 2 Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
LIS512 lecture 1 the entity-relationship model Thomas Krichel
 By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Word Processing Standard Grade Computing LA/LM. Word processor a computer program that allows you to manipulate text What is?
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
The Web-based Data Collection in the Italian Population and Housing Census Leonardo Tininini and Antonino Virgillito ISTAT Meeting on the Management of.
Sources of authentic data Term 2 – Week 9 VCE IT – UNIT 2.
CHAPTER 1 STATISTICS Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world.
The AIE Monte Carlo Tool The AIE Monte Carlo tool is an Excel spreadsheet and a set of supporting macros. It is the main tool used in AIE analysis of a.
Databases Introduction. What is a Database? A DATABASE is a collection of related data. –Data is just another name for information.
Chapter 12 Information Systems. 2 Managing Information Information system Software that helps the user organize and analyze data Electronic spreadsheets.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSE Introduction to Computing Concepts. Outline  What is an application program?  What is Excel?  Creating a Simple Workbook  Writing Formulas.
Access 2013 Microsoft Access 2013 is a database application that is ideal for gathering and understanding data that’s been collected on just about anything.
Database Essentials. Key Terms Big Data Describes a dataset that cannot be stored or processed using traditional database software. Examples: Google search.
Concepts of Database Management Seventh Edition
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Database What is a database? A database is a collection of information that is typically organized so that it can easily be storing, managing and retrieving.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
CAP Cryptographic Analysis Program Anagram Help Presentation Press Enter or click on your mouse button to continue.
ITGS Databases.
Concepts of Database Management Eighth Edition Chapter 3 The Relational Model 2: SQL.
Database Applications – Microsoft Access Lesson 4 Working with Queries 36 Slides in Presentation.
Institute for Personal Robots in Education (IPRE)‏ CSC 170 Computing: Science and Creativity.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
MS Sequence Clustering
Chapter 23: Probabilistic Language Models April 13, 2004.
Building Dashboards SharePoint and Business Intelligence.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
Warm-up (independent, level 0 noise): Please complete this in your journal below the CQ 1. If I give my son $12.35 each week in allowance, how much will.
Goals Define the role of general information systems Explain how spreadsheets are organized (cells, lists or columns, tables, sheets, [hierarchy], sort,
Access Lessons 1, 2 and 3 ©2009 M and K Solutions, LLC – All Rights Reserved.
Data Verification and Validation
Survey Solutions An Introduction to Surveying. What Is a Survey? A survey is a series of questions asked of a group of people in order to gain information.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 12 Information Systems: Spreadsheets Nell Dale John Lewis.
Lexile Project Guidelines for Data Collection and Analysis.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
NoSQL: Graph Databases. Databases Why NoSQL Databases?
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
 frequency table – a table that lists each item in a data set and the number of times each item occurs.
Access Lessons 1, 2 and 3 ©2009 M and K Solutions, LLC – All Rights Reserved.
 frequency table – table that lists each item in a data set and number of times each item occurs.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Click once to reveal the definition. Think of the answer. Then click to see if you were correct. Spreadsheet / Workbook A grid of rows and columns containing.
From digital to craft: How to make a data matrix with SNS data There are many applications that allow its users to visualize different networks directly.
Prepared By: Bobby Wan Microsoft Access Prepared By: Bobby Wan
AP CSP: Cleaning Data & Creating Summary Tables
Learn about relations and their basic properties
Vocabulary byte - The technical term for 8 bits of data.
Lecture 5 Transition Diagrams
Vocabulary byte - The technical term for 8 bits of data.
N-Gram Model Formulas Word sequences Chain rule of probability
Analyzing Bivariate Data
Statistical n-gram David ling.
Spreadsheets, Modelling & Databases
Graph and Link Mining.
Presentation transcript:

Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004

 “Big data” refers to the analysis of any large corpus of data to extract information, usually by means of statistical techniques  We’ve already seen big data in action  In Target’s analysis of purchasing behavior to determine if a woman is pregnant  In Google’s Page Rank, their estimate of how significant a page is  Today we discuss big data … but there is a limit to how big it can get in a lecture! 12/3/2015© Larry Snyder, CSE2

 Data is everywhere –  All Facebook users constitute a data archive that Facebook analyzes continually  Google has crawled the WWW for years … their data is truly big!  Census bureau  UW’s student database  Every company’s consumer records  Governmental DBs like licensing, tax revenues,etc.  … 12/3/2015© Larry Snyder, CSE3

 Data that is regularly gathered for some purpose typically contains a lot more information than the recorded numbers reveal... and it’s often easy t0 get 12/3/2015© Larry Snyder, CSE4

 The data used for our “heat map” of birthdays came from birth certificate records CDC birth data for the years Processing: add by day sort descending, plot 12/3/2015© Larry Snyder, CSE5

 Preference for birthdays …  People love Valentines Day, and hate Halloween  Compute average for each day around date, plot 12/3/2015© Larry Snyder, CSE6

 Plot raw data  12/3/2015© Larry Snyder, CSE7

 One thing you can do is figure out how often certain letters occur … good for “wink-comm” 12/3/2015© Larry Snyder, CSE8

 Counting the frequency of letters is more technically called “computing a 1-gram”  More generally, an n-gram is counting up the frequency of sequences (of pretty much anything digital) of length n  The 2-grams of this DNA: CGTTGACAACGT are: CG, GT, TT, TG, GA, AC, CA, AA, AC, CG, GT … so CG & GT occur twice, others just once  The 2-grams of words in “To be or not to be” are: to-be, be-or, or-not, not-to, to-be  Etc. 12/3/2015© Larry Snyder, CSE9

 Google has a lot of text, and has compiled the n-grams for tokens (i.e. words, non-blank letter sequences followed by punctuation or blank)  Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663 12/3/2015© Larry Snyder, CSE10

 Spelling correction software: Using an n-gram of letters, what’s wrong with “thniking”?  Optical Character Recognition … if you have figured out “to be or not to ” you might use word 2-grams starting with “to”  Google will just show you cool plots … 12/3/2015© Larry Snyder, CSE11

12/3/2015© Larry Snyder, CSE12

 CheapOair analyzed ticket price for 4 million airline trips in 2013 from 320 days before flight  Fifty-four days before takeoff is, on average, when domestic airline tickets are at their absolute lowest price.  Prime booking window: days before your trip … usually within $10 of best price 12/3/2015© Larry Snyder, CSE13

12/3/2015© Larry Snyder, CSE14

 Revolutionary War Boston Club membership as relayed in Fischer’s Book  Find it in the appendix … type it in  Analysis by Kieran Healy /archives/2013/06/09/using- metadata-to-find-paul-revere/ 12/3/2015© Larry Snyder, CSE15

 A 254 x 7 table: colonist x organization  Organize such data with spreadsheet software 12/3/2015© Larry Snyder, CSE16

 A 254 x 7 table: colonist x organization 12/3/2015© Larry Snyder, CSE17 Metadata

 Now, transpose the table – make columns rows, and rows columns  A 7x254 table: organization x colonist 12/3/2015© Larry Snyder, CSE18

 It produces a 254 x 254 table that shows for any pair of people (one in row and one in column) how many associations they have in common!  The non-zero entries indicate pairs that might be collaborators 12/3/2015© Larry Snyder, CSE19

 Produces an organization x organization table saying how many members each pair (row, column) have in common 12/3/2015© Larry Snyder, CSE20

 12/3/2015© Larry Snyder, CSE21 Edge weight (line thickness) is number of common members

12/3/2015© Larry Snyder, CSE22

 A person of suspicion!  Used membership metadata, performed normal analysis on it, identified key player 12/3/2015© Larry Snyder, CSE23

 How likely is that in the graph of who’s connected to whom, a shortest path goes through a specific person – measure of connectedness  Paul Revere is on 3839 shortest paths in the graph … he may be a guy to keep tabs on! 12/3/2015© Larry Snyder, CSE24

 Data collections are everywhere  Analyzing them can discover amazing facts  Forms of analysis  Many techniques reveal interesting results with very primitive tools  We saw sorting, plotting, averaging, matrix product, centrality measures  Statistical software already exists  Mostly, the information can be “anonymized” 12/3/2015© Larry Snyder, CSE25

 Data collections are everywhere  Analyzing them can discover amazing facts  Forms of analysis  Many techniques reveal interesting results with very primitive tools  We saw sorting, plotting, averaging, matrix product, centrality measures  Statistical software already exists  Mostly, the information can be “anonymized” 12/3/2015© Larry Snyder, CSE26 How can big data be useful to you?