Preparing your Data using Python Samuel G. Mori, CISA Managing Partner, Analytics & Advisory Services Spyrion LLC April 12, 2018
Background Samuel G. Mori, CISA, Six Sigma Green Belt Managing Partner, Analytics and Advisory Services Software Quality Assurance, Internal/External Audit, Business Intelligence and Reporting, Advisory Services (GRC and Analytics) Subject matter expertise within commercial, manufacturing, healthcare, biomedical and entertainment sectors B.S. Cognitive Science – Human Computer Interaction (UC San Diego) M.S. Accountancy – Accounting Information Systems (San Diego State) M.S. Data Science – Analytics & Modeling (Northwestern)
Agenda Learning Objectives Why should I prepare my data? What types of data might I encounter? How can Python help me?
Learning Objectives Understand the importance of preparing your data for analysis Understand different types of data formats you may encounter Understand what Python is and why you should use it Understand strategies and techniques for importing, preparing, and saving your data using Python
Why should I prepare my data? Garbage in, garbage out Reduce errors Remove duplicate records Fix missing values Correct range values Fix formatting (i.e. date, text, number)
Experience Check How many people have experience with Python? What types of data formats do you use in your organizations? CSV, Excel, PDF, JSON, XML, SQL databases, etc What types of tools do you use? Excel, ACL, IDEA, SQL Server, Python, R, SAS, Cognos, etc
What types of data formats might I encounter? Comma Separated Value (CSV) Excel JavaScript Object Notation (JSON) Structured Query Language (SQL) And more! Python can help with these!
CSV Example SFO Airport Survey Results
Excel Example SFO Airport Survey Results
JSON Examples Trip Advisor JSON file Yelp JSON file
SQL Example Sample Customer Data
What is Python? Definition Object-oriented, high-level programming language Used as a scripting or glue language to connect existing components together Simple, easy to learn syntax emphasizes readability Supports modules and packages Python interpreter and the extensive standard library are FREE!
What is Python? (cont.) Key Python Package: Pandas Open source library that allows you to work with CSV, Excel, JSON, and SQL database files, pull them into tables (called dataframes), and perform various data analysis techniques.
Coding Basics Some basic python syntax to keep in mind: Declaring a variable (always to the left of equal sign) File names (can use “ “ or ‘ ‘) dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’) Or file_name = ‘file_name.xlsx’ sheet_name = ‘sheet_name’ dataframe = pandas.read_excel(file_name, sheet_name)
Coding Basics (cont.) Some basic python syntax to keep in mind: Using library packages Import pandas as pd #calling pandas library and creating reference ‘pd’ dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’) Or dataframe = pandas.read_excel(‘file_name.xlsx', ‘sheet_name’)
SFO Airport Customer Survey Data – Excel & CSV files Case Study SFO Airport Customer Survey Data – Excel & CSV files
Importing the Data How do I import an Excel file?
Data Characteristics What columns do we have?
Data Characteristics What if I just want a subset of these columns?
Data Characteristics What columns do I have and what are their data types?
Data Characteristics How many columns and records do I have? Can I do a count of different values within a column?
Modifying Data Values Lets look at the data dictionary How do I replace values to make them meaningful?
Saving to Excel How do I save this new file? What does my file look like?
Importing the Data How do I import a CSV file? What is NaN?
Fixing Error Values How do I fix NaN values?
Adding Custom Columns What if I want to add the Year in a column?
Identifying Value Ranges How do I look at the data value ranges for multiple columns?
Saving to CSV How do I save this new file? What does my file look like?
Appendix
Additional Information Python Development Environments Enthought Canopy https://www.enthought.com/product/canopy/ Anaconda/Spyder https://www.anaconda.com/download/ Python Libraries Pandas http://pandas.pydata.org/
Questions?