Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 6: Data Quality and Pandas

Similar presentations


Presentation on theme: "Lecture 6: Data Quality and Pandas"— Presentation transcript:

1 Lecture 6: Data Quality and Pandas
CSE 482: Big Data Analysis Lecture 6: Data Quality and Pandas

2 Outline So far, we have discussed about This lecture:
How to collect data? How to store data in a database? This lecture: Problems with “raw” data Introduction to Pandas

3 Real Data are often Imperfect
Garbage In, Garbage Out Image source:

4 Data Quality Issues What are the data quality issues and how do they arise in your data? Noise Outliers Missing values Duplicate data How do these issues affect data analysis? How should we address the issues?

5 Noise Noise refers to incorrect/modified values of the data
E.g., distortion of a person’s voice when talking on a poor phone Noisy Audio Original audio Audio source:

6 Example 1: Social Media Data
Using twitter data for disaster management Web site shows some of the tweets containing the word “wildfire” Noise due to typos, abbreviation, etc data collection error (e.g., tweet about a song called Wildfire)

7 Example 2: Lake Nutrients Data
Noise due to Human entry mistakes Errors when combining data from multiple sources (different scales of measurements, incorrect data columns, etc) Measurements of total phosphorous in lakes from 17 states in the United States

8 How To Deal with Noise? Depends on type of data
For time series (e.g., audio, sensors, finance,etc), etc.), low pass filters are often used For document data, can use software (e.g., for spell checkers, abbreviation expansion, etc) or lookup dictionary Deal with noise in your analysis E.g., Use probability models to capture uncertainties (noise) in data

9 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Correlation = Outlier Outliers may inadvertantly increase correlation of data Correlation =

10 Outliers versus Noise Outliers are legitimate data values
Unlike noise, they may represent interesting events or characteristics of the data We will discuss more about techniques for detecting outliers later in the semester

11 Missing Values Reasons for missing values Handling missing values
Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects with Missing Values Estimate the Missing Values (Imputation methods) Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)

12 Duplicate Data Data set may include data objects that are duplicates
Major problem when merging data from multiple sources Examples: Product table: Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984) Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. Deduplication/Entity Resolution Process of dealing with duplicate data issues

13 Deduplication Methods
Distance-based (for field matching) Example: edit distance (# of inserts, deletes and substitutions needed to convert one string into another) Other measures: affine gap, Smith-Waterman, etc Clustering-based Extends distance/similarity-based beyond pairwise comparison

14 Python Pandas An important Python library for data analysis
Primary data structures Series A one-dimensional array-like object DataFrame A tabular, spreadsheet-like data structure containing a collection of ordered columns, each of which can be a different type (numeric, string, boolean, etc)

15 Series Value Index

16 Series

17 Series

18 Series Size of vector (number of elements) Size of matrix
(number of rows & columns) Find number of days in which stock price is higher than 32 Count number of non-NA and non-NULL values

19 Series Create a new Series that contains changes in the stock price
Check whether stock price change is more than $0.50 Count the number of days in which stock price has changed by > $0.50

20 Series: Plotting and Aggregates

21 Series: Identifying Outliers

22 Series: Identifying Outliers
Assuming the data comes from a normal (Gaussian) probability distribution The further we are from the center, the more likely it is an outlier Calculate the Z-score of the variable X 𝑍= 𝑋 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑_𝑑𝑒𝑣(𝑋) Z-score lets us know how far are we from the center of the distribution for X

23 Series: Identifying Outliers
Find the days in which stock price is more than 2 standard deviations away from their mean value

24 Series: Dealing with Missing Values
Set price for 1/12/2017 to missing value Count number of non-missing values Check whether value is missing

25 Series: Dealing with Missing Values
Discard data with missing values For a complete list of methods available for Pandas Series, go to

26 DataFrame You can consider the data frame as a dictionary that contains 3 Series: Age, Height, and Name

27 DataFrame: Selection and Indexing
The Age column is a Series object, Indexed by the row index Each row is a Series object, Values are indexed by column name

28 DataFrame: Selection and Indexing

29 DataFrame: Selection and Indexing
Column Index Values Row Index Column index Row index Size of data matrix (number of elements)

30 DataFrame: Sizing and Transposing
Transpose (T) operation: Flip the rows and columns of the data frame

31 DataFrame: Transformation
Converting height from feet into meters

32 DataFrame: Aggregate Function

33 DataFrame: Group By

34 DataFrame: Describe

35 DataFrame: Sorting

36 DataFrame: Standardization
Convert each numeric attribute into Z-score: We can use the standardized values to detect outliers

37 DataFrame: Missing Values

38 DataFrame: Missing Values

39 DataFrame: Missing Values

40 DataFrame: Duplicate Detection
Create a duplicate row

41 Example: 2016 Campaign Donation
Data are available from the Federal Election Commission website In this example, we focus on 2016 Michigan data

42 Example

43 Example

44 Example This will iterate through the columns and display their names, # missing values, types of values

45 Example Some campaign contribution amounts are negative valued!
This corresponds to refunds

46 Example

47 Summary In this lecture, we review Next lecture Data quality issues
Python pandas library Next lecture Data preprocessing


Download ppt "Lecture 6: Data Quality and Pandas"

Similar presentations


Ads by Google