Lecture 6: Data Quality and Pandas

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Lecture Notes for Chapter 2 Introduction to Data Mining
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Lecture II-2: Probability Review
Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Multivariate Statistical Data Analysis with Its Applications
LSP 121 Week 1 Intro to Databases. Welcome to LSP 121 Quantitative Reasoning and Technological Literacy II Continuation of quantitative data concepts.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
What is Data? Attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Today’s Goals Answer questions about homework and lecture 2 Understand what a query is Understand how to create simple queries using Microsoft Access 2007.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
The Abnormal Distribution
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Data Mining What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Python for data analysis Prakhar Amlathe Utah State University
SNS COLLEGE OF TECHNOLOGY
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Large-Scale Content-Based Audio Retrieval from Text Queries
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques
Noisy Data Noise: random error or variance in a measured variable.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
A Methodology for Finding Bad Data
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Lecture Notes for Chapter 2 Introduction to Data Mining
Chapter 5: Arrays: Lists and Tables
Mean Shift Segmentation
Evaluation of Relational Operations
Fitting Curve Models to Edges
Lecture Slides Elementary Statistics Thirteenth Edition
Lecture Slides Elementary Statistics Eleventh Edition
Data cleaning and transformation
Lecture 7: Data Preprocessing
Data Preprocessing Modified from
CSCI N317 Computation for Scientific Applications Unit Weka
Course Introduction CSC 576: Data Mining.
Spreadsheets, Modelling & Databases
CSE 491/891 Lecture 25 (Mahout).
Lecture 1: Descriptive Statistics and Exploratory
TransCAD Working with Matrices 2019/4/29.
Hashing.
Chapter 5: z-Scores.
Wellington Cabrera Advisor: Carlos Ordonez
Data Pre-processing Lecture Notes for Chapter 2
Databases WOW!! A database is a collection of related data.
Presentation transcript:

Lecture 6: Data Quality and Pandas CSE 482: Big Data Analysis Lecture 6: Data Quality and Pandas

Outline So far, we have discussed about This lecture: How to collect data? How to store data in a database? This lecture: Problems with “raw” data Introduction to Pandas

Real Data are often Imperfect Garbage In, Garbage Out Image source: http://www.lovemytool.com/.a/6a00e008d957708834013484842699970c-pi

Data Quality Issues What are the data quality issues and how do they arise in your data? Noise Outliers Missing values Duplicate data How do these issues affect data analysis? How should we address the issues?

Noise Noise refers to incorrect/modified values of the data E.g., distortion of a person’s voice when talking on a poor phone Noisy Audio Original audio Audio source: http://www.nasa.gov/mission_pages/apollo/apollo11_audio.html

Example 1: Social Media Data Using twitter data for disaster management Web site shows some of the tweets containing the word “wildfire” Noise due to typos, abbreviation, etc data collection error (e.g., tweet about a song called Wildfire)

Example 2: Lake Nutrients Data Noise due to Human entry mistakes Errors when combining data from multiple sources (different scales of measurements, incorrect data columns, etc) Measurements of total phosphorous in lakes from 17 states in the United States

How To Deal with Noise? Depends on type of data For time series (e.g., audio, sensors, finance,etc), etc.), low pass filters are often used For document data, can use software (e.g., for spell checkers, abbreviation expansion, etc) or lookup dictionary Deal with noise in your analysis E.g., Use probability models to capture uncertainties (noise) in data

Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Correlation = 0.6278 Outlier Outliers may inadvertantly increase correlation of data Correlation = 0.0628

Outliers versus Noise Outliers are legitimate data values Unlike noise, they may represent interesting events or characteristics of the data We will discuss more about techniques for detecting outliers later in the semester

Missing Values Reasons for missing values Handling missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects with Missing Values Estimate the Missing Values (Imputation methods) Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)

Duplicate Data Data set may include data objects that are duplicates Major problem when merging data from multiple sources Examples: Product table: Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984). Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984. Deduplication/Entity Resolution Process of dealing with duplicate data issues

Deduplication Methods Distance-based (for field matching) Example: edit distance (# of inserts, deletes and substitutions needed to convert one string into another) Other measures: affine gap, Smith-Waterman, etc Clustering-based Extends distance/similarity-based beyond pairwise comparison

Python Pandas An important Python library for data analysis Primary data structures Series A one-dimensional array-like object DataFrame A tabular, spreadsheet-like data structure containing a collection of ordered columns, each of which can be a different type (numeric, string, boolean, etc)

Series Value Index

Series

Series

Series Size of vector (number of elements) Size of matrix (number of rows & columns) Find number of days in which stock price is higher than 32 Count number of non-NA and non-NULL values

Series Create a new Series that contains changes in the stock price Check whether stock price change is more than $0.50 Count the number of days in which stock price has changed by > $0.50

Series: Plotting and Aggregates

Series: Identifying Outliers

Series: Identifying Outliers Assuming the data comes from a normal (Gaussian) probability distribution The further we are from the center, the more likely it is an outlier Calculate the Z-score of the variable X 𝑍= 𝑋 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑_𝑑𝑒𝑣(𝑋) Z-score lets us know how far are we from the center of the distribution for X

Series: Identifying Outliers Find the days in which stock price is more than 2 standard deviations away from their mean value

Series: Dealing with Missing Values Set price for 1/12/2017 to missing value Count number of non-missing values Check whether value is missing

Series: Dealing with Missing Values Discard data with missing values For a complete list of methods available for Pandas Series, go to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

DataFrame You can consider the data frame as a dictionary that contains 3 Series: Age, Height, and Name

DataFrame: Selection and Indexing The Age column is a Series object, Indexed by the row index Each row is a Series object, Values are indexed by column name

DataFrame: Selection and Indexing

DataFrame: Selection and Indexing Column Index Values Row Index Column index Row index Size of data matrix (number of elements)

DataFrame: Sizing and Transposing Transpose (T) operation: Flip the rows and columns of the data frame

DataFrame: Transformation Converting height from feet into meters

DataFrame: Aggregate Function

DataFrame: Group By

DataFrame: Describe

DataFrame: Sorting

DataFrame: Standardization Convert each numeric attribute into Z-score: We can use the standardized values to detect outliers

DataFrame: Missing Values

DataFrame: Missing Values

DataFrame: Missing Values

DataFrame: Duplicate Detection Create a duplicate row

Example: 2016 Campaign Donation Data are available from the Federal Election Commission website http://www.fec.gov/disclosurep/PDownload.do In this example, we focus on 2016 Michigan data

Example

Example

Example This will iterate through the columns and display their names, # missing values, types of values

Example Some campaign contribution amounts are negative valued! This corresponds to refunds

Example

Summary In this lecture, we review Next lecture Data quality issues Python pandas library Next lecture Data preprocessing