Download presentation
Presentation is loading. Please wait.
1
Lecture 6: Data Quality and Pandas
CSE 482: Big Data Analysis Lecture 6: Data Quality and Pandas
2
Outline So far, we have discussed about This lecture:
How to collect data? How to store data in a database? This lecture: Problems with “raw” data Introduction to Pandas
3
Real Data are often Imperfect
Garbage In, Garbage Out Image source:
4
Data Quality Issues What are the data quality issues and how do they arise in your data? Noise Outliers Missing values Duplicate data How do these issues affect data analysis? How should we address the issues?
5
Noise Noise refers to incorrect/modified values of the data
E.g., distortion of a person’s voice when talking on a poor phone Noisy Audio Original audio Audio source:
6
Example 1: Social Media Data
Using twitter data for disaster management Web site shows some of the tweets containing the word “wildfire” Noise due to typos, abbreviation, etc data collection error (e.g., tweet about a song called Wildfire)
7
Example 2: Lake Nutrients Data
Noise due to Human entry mistakes Errors when combining data from multiple sources (different scales of measurements, incorrect data columns, etc) Measurements of total phosphorous in lakes from 17 states in the United States
8
How To Deal with Noise? Depends on type of data
For time series (e.g., audio, sensors, finance,etc), etc.), low pass filters are often used For document data, can use software (e.g., for spell checkers, abbreviation expansion, etc) or lookup dictionary Deal with noise in your analysis E.g., Use probability models to capture uncertainties (noise) in data
9
Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Correlation = Outlier Outliers may inadvertantly increase correlation of data Correlation =
10
Outliers versus Noise Outliers are legitimate data values
Unlike noise, they may represent interesting events or characteristics of the data We will discuss more about techniques for detecting outliers later in the semester
11
Missing Values Reasons for missing values Handling missing values
Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects with Missing Values Estimate the Missing Values (Imputation methods) Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)
12
Duplicate Data Data set may include data objects that are duplicates
Major problem when merging data from multiple sources Examples: Product table: Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984) Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. Deduplication/Entity Resolution Process of dealing with duplicate data issues
13
Deduplication Methods
Distance-based (for field matching) Example: edit distance (# of inserts, deletes and substitutions needed to convert one string into another) Other measures: affine gap, Smith-Waterman, etc Clustering-based Extends distance/similarity-based beyond pairwise comparison
14
Python Pandas An important Python library for data analysis
Primary data structures Series A one-dimensional array-like object DataFrame A tabular, spreadsheet-like data structure containing a collection of ordered columns, each of which can be a different type (numeric, string, boolean, etc)
15
Series Value Index
16
Series
17
Series
18
Series Size of vector (number of elements) Size of matrix
(number of rows & columns) Find number of days in which stock price is higher than 32 Count number of non-NA and non-NULL values
19
Series Create a new Series that contains changes in the stock price
Check whether stock price change is more than $0.50 Count the number of days in which stock price has changed by > $0.50
20
Series: Plotting and Aggregates
21
Series: Identifying Outliers
22
Series: Identifying Outliers
Assuming the data comes from a normal (Gaussian) probability distribution The further we are from the center, the more likely it is an outlier Calculate the Z-score of the variable X 𝑍= 𝑋 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑_𝑑𝑒𝑣(𝑋) Z-score lets us know how far are we from the center of the distribution for X
23
Series: Identifying Outliers
Find the days in which stock price is more than 2 standard deviations away from their mean value
24
Series: Dealing with Missing Values
Set price for 1/12/2017 to missing value Count number of non-missing values Check whether value is missing
25
Series: Dealing with Missing Values
Discard data with missing values For a complete list of methods available for Pandas Series, go to
26
DataFrame You can consider the data frame as a dictionary that contains 3 Series: Age, Height, and Name
27
DataFrame: Selection and Indexing
The Age column is a Series object, Indexed by the row index Each row is a Series object, Values are indexed by column name
28
DataFrame: Selection and Indexing
29
DataFrame: Selection and Indexing
Column Index Values Row Index Column index Row index Size of data matrix (number of elements)
30
DataFrame: Sizing and Transposing
Transpose (T) operation: Flip the rows and columns of the data frame
31
DataFrame: Transformation
Converting height from feet into meters
32
DataFrame: Aggregate Function
33
DataFrame: Group By
34
DataFrame: Describe
35
DataFrame: Sorting
36
DataFrame: Standardization
Convert each numeric attribute into Z-score: We can use the standardized values to detect outliers
37
DataFrame: Missing Values
38
DataFrame: Missing Values
39
DataFrame: Missing Values
40
DataFrame: Duplicate Detection
Create a duplicate row
41
Example: 2016 Campaign Donation
Data are available from the Federal Election Commission website In this example, we focus on 2016 Michigan data
42
Example
43
Example
44
Example This will iterate through the columns and display their names, # missing values, types of values
45
Example Some campaign contribution amounts are negative valued!
This corresponds to refunds
46
Example
47
Summary In this lecture, we review Next lecture Data quality issues
Python pandas library Next lecture Data preprocessing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.