Lecture 6: Data Quality and Pandas

Lecture 6: Data Quality and Pandas
CSE 482: Big Data Analysis Lecture 6: Data Quality and Pandas

Outline So far, we have discussed about This lecture:
How to collect data? How to store data in a database? This lecture: Problems with “raw” data Introduction to Pandas

Real Data are often Imperfect
Garbage In, Garbage Out Image source:

Data Quality Issues What are the data quality issues and how do they arise in your data? Noise Outliers Missing values Duplicate data How do these issues affect data analysis? How should we address the issues?

Noise Noise refers to incorrect/modified values of the data
E.g., distortion of a person’s voice when talking on a poor phone Noisy Audio Original audio Audio source:

Example 1: Social Media Data
Using twitter data for disaster management Web site shows some of the tweets containing the word “wildfire” Noise due to typos, abbreviation, etc data collection error (e.g., tweet about a song called Wildfire)

Example 2: Lake Nutrients Data
Noise due to Human entry mistakes Errors when combining data from multiple sources (different scales of measurements, incorrect data columns, etc) Measurements of total phosphorous in lakes from 17 states in the United States

How To Deal with Noise? Depends on type of data
For time series (e.g., audio, sensors, finance,etc), etc.), low pass filters are often used For document data, can use software (e.g., for spell checkers, abbreviation expansion, etc) or lookup dictionary Deal with noise in your analysis E.g., Use probability models to capture uncertainties (noise) in data

Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Correlation = Outlier Outliers may inadvertantly increase correlation of data Correlation =

Outliers versus Noise Outliers are legitimate data values
Unlike noise, they may represent interesting events or characteristics of the data We will discuss more about techniques for detecting outliers later in the semester

Missing Values Reasons for missing values Handling missing values
Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects with Missing Values Estimate the Missing Values (Imputation methods) Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)

Duplicate Data Data set may include data objects that are duplicates
Major problem when merging data from multiple sources Examples: Product table: Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984) Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. Deduplication/Entity Resolution Process of dealing with duplicate data issues

Deduplication Methods
Distance-based (for field matching) Example: edit distance (# of inserts, deletes and substitutions needed to convert one string into another) Other measures: affine gap, Smith-Waterman, etc Clustering-based Extends distance/similarity-based beyond pairwise comparison

Python Pandas An important Python library for data analysis
Primary data structures Series A one-dimensional array-like object DataFrame A tabular, spreadsheet-like data structure containing a collection of ordered columns, each of which can be a different type (numeric, string, boolean, etc)

Series Value Index

Series

Series Size of vector (number of elements) Size of matrix
(number of rows & columns) Find number of days in which stock price is higher than 32 Count number of non-NA and non-NULL values

Series Create a new Series that contains changes in the stock price
Check whether stock price change is more than $0.50 Count the number of days in which stock price has changed by > $0.50

Series: Plotting and Aggregates

Series: Identifying Outliers

Assuming the data comes from a normal (Gaussian) probability distribution The further we are from the center, the more likely it is an outlier Calculate the Z-score of the variable X 𝑍= 𝑋 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑_𝑑𝑒𝑣(𝑋) Z-score lets us know how far are we from the center of the distribution for X

Find the days in which stock price is more than 2 standard deviations away from their mean value

Series: Dealing with Missing Values
Set price for 1/12/2017 to missing value Count number of non-missing values Check whether value is missing

Series: Dealing with Missing Values
Discard data with missing values For a complete list of methods available for Pandas Series, go to

DataFrame You can consider the data frame as a dictionary that contains 3 Series: Age, Height, and Name

DataFrame: Selection and Indexing
The Age column is a Series object, Indexed by the row index Each row is a Series object, Values are indexed by column name

Column Index Values Row Index Column index Row index Size of data matrix (number of elements)

DataFrame: Sizing and Transposing
Transpose (T) operation: Flip the rows and columns of the data frame

DataFrame: Transformation
Converting height from feet into meters

DataFrame: Aggregate Function

DataFrame: Group By

DataFrame: Describe

DataFrame: Sorting

DataFrame: Standardization
Convert each numeric attribute into Z-score: We can use the standardized values to detect outliers

DataFrame: Missing Values

DataFrame: Duplicate Detection
Create a duplicate row

Example: 2016 Campaign Donation
Data are available from the Federal Election Commission website In this example, we focus on 2016 Michigan data

Example

Example This will iterate through the columns and display their names, # missing values, types of values

Example Some campaign contribution amounts are negative valued!
This corresponds to refunds

Example

Summary In this lecture, we review Next lecture Data quality issues
Python pandas library Next lecture Data preprocessing

Lecture 6: Data Quality and Pandas

Similar presentations

Presentation on theme: "Lecture 6: Data Quality and Pandas"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 6: Data Quality and Pandas

Similar presentations

Presentation on theme: "Lecture 6: Data Quality and Pandas"— Presentation transcript:

Similar presentations

About project

Feedback