Lecture 6: Data Quality and Pandas CSE 482: Big Data Analysis Lecture 6: Data Quality and Pandas
Outline So far, we have discussed about This lecture: How to collect data? How to store data in a database? This lecture: Problems with “raw” data Introduction to Pandas
Real Data are often Imperfect Garbage In, Garbage Out Image source: http://www.lovemytool.com/.a/6a00e008d957708834013484842699970c-pi
Data Quality Issues What are the data quality issues and how do they arise in your data? Noise Outliers Missing values Duplicate data How do these issues affect data analysis? How should we address the issues?
Noise Noise refers to incorrect/modified values of the data E.g., distortion of a person’s voice when talking on a poor phone Noisy Audio Original audio Audio source: http://www.nasa.gov/mission_pages/apollo/apollo11_audio.html
Example 1: Social Media Data Using twitter data for disaster management Web site shows some of the tweets containing the word “wildfire” Noise due to typos, abbreviation, etc data collection error (e.g., tweet about a song called Wildfire)
Example 2: Lake Nutrients Data Noise due to Human entry mistakes Errors when combining data from multiple sources (different scales of measurements, incorrect data columns, etc) Measurements of total phosphorous in lakes from 17 states in the United States
How To Deal with Noise? Depends on type of data For time series (e.g., audio, sensors, finance,etc), etc.), low pass filters are often used For document data, can use software (e.g., for spell checkers, abbreviation expansion, etc) or lookup dictionary Deal with noise in your analysis E.g., Use probability models to capture uncertainties (noise) in data
Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Correlation = 0.6278 Outlier Outliers may inadvertantly increase correlation of data Correlation = 0.0628
Outliers versus Noise Outliers are legitimate data values Unlike noise, they may represent interesting events or characteristics of the data We will discuss more about techniques for detecting outliers later in the semester
Missing Values Reasons for missing values Handling missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects with Missing Values Estimate the Missing Values (Imputation methods) Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)
Duplicate Data Data set may include data objects that are duplicates Major problem when merging data from multiple sources Examples: Product table: Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984). Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. Deduplication/Entity Resolution Process of dealing with duplicate data issues
Deduplication Methods Distance-based (for field matching) Example: edit distance (# of inserts, deletes and substitutions needed to convert one string into another) Other measures: affine gap, Smith-Waterman, etc Clustering-based Extends distance/similarity-based beyond pairwise comparison
Python Pandas An important Python library for data analysis Primary data structures Series A one-dimensional array-like object DataFrame A tabular, spreadsheet-like data structure containing a collection of ordered columns, each of which can be a different type (numeric, string, boolean, etc)
Series Value Index
Series
Series
Series Size of vector (number of elements) Size of matrix (number of rows & columns) Find number of days in which stock price is higher than 32 Count number of non-NA and non-NULL values
Series Create a new Series that contains changes in the stock price Check whether stock price change is more than $0.50 Count the number of days in which stock price has changed by > $0.50
Series: Plotting and Aggregates
Series: Identifying Outliers
Series: Identifying Outliers Assuming the data comes from a normal (Gaussian) probability distribution The further we are from the center, the more likely it is an outlier Calculate the Z-score of the variable X 𝑍= 𝑋 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑_𝑑𝑒𝑣(𝑋) Z-score lets us know how far are we from the center of the distribution for X
Series: Identifying Outliers Find the days in which stock price is more than 2 standard deviations away from their mean value
Series: Dealing with Missing Values Set price for 1/12/2017 to missing value Count number of non-missing values Check whether value is missing
Series: Dealing with Missing Values Discard data with missing values For a complete list of methods available for Pandas Series, go to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
DataFrame You can consider the data frame as a dictionary that contains 3 Series: Age, Height, and Name
DataFrame: Selection and Indexing The Age column is a Series object, Indexed by the row index Each row is a Series object, Values are indexed by column name
DataFrame: Selection and Indexing
DataFrame: Selection and Indexing Column Index Values Row Index Column index Row index Size of data matrix (number of elements)
DataFrame: Sizing and Transposing Transpose (T) operation: Flip the rows and columns of the data frame
DataFrame: Transformation Converting height from feet into meters
DataFrame: Aggregate Function
DataFrame: Group By
DataFrame: Describe
DataFrame: Sorting
DataFrame: Standardization Convert each numeric attribute into Z-score: We can use the standardized values to detect outliers
DataFrame: Missing Values
DataFrame: Missing Values
DataFrame: Missing Values
DataFrame: Duplicate Detection Create a duplicate row
Example: 2016 Campaign Donation Data are available from the Federal Election Commission website http://www.fec.gov/disclosurep/PDownload.do In this example, we focus on 2016 Michigan data
Example
Example
Example This will iterate through the columns and display their names, # missing values, types of values
Example Some campaign contribution amounts are negative valued! This corresponds to refunds
Example
Summary In this lecture, we review Next lecture Data quality issues Python pandas library Next lecture Data preprocessing