Data visualization and manipulation with Open Office Calc and Microsoft Excel
Agenda About Excel/Calc Spreadsheets Key Features What is the structure of spreadsheets? What can I do with them? Advantages/Disadvantages Key Features How does it fit in with other tools? Example – Import and analyze the Framingham data set
Consider the tasks you will need to perform Data collection and editing? Data transformation/calculations? Basic vs. advanced statistics? Making figures? A single platform may not be enough, but use as few as possible Software Mac PC Data entry/ Forms Data editing Transform data Basic stats Adv statics Figures REDCap web +++ - + Epi Info X ++ MS Office/ Open Office R/RStudio
Basics about Excel/Calc Microsoft Excel – standard part of MS Office Calc: https://www.openoffice.org Similar to, compatible with Microsoft Excel OpenOffice vs. Excel: OOO Free (+ upgrades) vs. $90-300 for Excel Unlimited licenses OOO uses more different file formats Excel is more widely used Excel has more direct customer support
Spreadsheets Can inspect, modify, and analyses of data in table form Can convert data of different formats (.xls, .csv, .txt, .tab, and more) Intuitive interface Little to no programming for most tasks Can be used to enter data, though not recommended Poor security, easy to lose track and have conflicting data
Basic spreadsheet
Spreadsheet structure Sheet = rows (1 to …) and columns (A to …) A cell is described by a row and a column E.g. A1, E5, F10, etc. Can specify a range of cells (A1:A5, A2:C10) Can use cell location to perform calculations E.g. C5 = A5+B5 Multiple sheets together make a workbook
What can I do with spreadsheets? Select, copy and paste rows or columns between data sets Visual filtering and sorting Link tables together using point-and-click Create expressions and formulas using drag- and-drop with visual selection of desired cells Can write more complex expressions but less flexible than other programs
What statistics can I do with Excel? Descriptive statistics Mean/median/mode Distributions Tables (including cross-tables) Charts/figures Comparisons Chi-square T-test
Why not use Excel as a database? Poor data integrity is limited Limited field validation Multiple file versions Can’t tell which data is correct!! Improper sorting can lead to data issues Limited rows and columns (1 million rows x 1024 columns) Important for translational data (e.g. genomics) Limited security Access (anyone can see your data) Auditing (you can’t tell who accessed or edited) Privileges (no way to limit what people can do to data)
Key features It has a simple interface Rows and columns are easily manipulated It is user friendly no basic programming It is compatible with other data analysis applications Limited statistical analysis capacity for large datasets
Data Analysis Toolpak - PC
Data Analysis Toolpak - Mac
Importing data - Excel DEMO
Importing data - Calc DEMO
Filtering Filter Function can allow you to show the data you want and hide the rest in a spreadsheet. Three types of filter: list values , format and criteria when using AutoFilter A drop-down arrow means that filtering is enabled but not applied A Filter button means that a filter is applied.
Filtering - Excel
Filtering - Calc
Descriptive statistics Describes your data using sample statistics to infer on population parameters Examples: Measures of location of data such Mean, Median , Mode( quantitative) or percentiles and proportions( qualitative) Measures of dispersion and peakedness of data such as standard deviation, variance and standard error, skewedness and kurtosis Demonstrations using Framingham Data
Rank and percentile
Derived variables - Excel
Derived variables - Calc
Creating figures - Excel
Creating Figures - Calc
Generating pie charts - Excel
Generating pie charts - Calc
Generating histograms
Generating tables - Excel
Generating tables - Calc
One-sided t-test
Guided Practice – Framingham Import Framingham practice data CSV file from REDCap Find mean systolic and diastolic BP (systolic_bp and diastolic_bp) (sum/number of measurements) Make a derived variable of systolic (systolic_bp) – diastolic blood pressure (diastolic_bp) Make a scatter plot of weight vs. systolic BP DEMO
Summary Calc/Excel is one of the most familiar and intuitive ways to interact with data It should NOT be used as a primary data collection tool if possible Use it to make look for missing data, perform calculations, filter, sort, and make simple plots Possible to make (pivot) table, but takes some practice Generally need another program for statistics
On own practice Determine mean diastolic blood pressure (diastolic_bp) overall and according to any history of hypertension (hypertension) Create a scatter plot of ounces of hemoglobin (hgb) vs. diastolic BP (diastolic_bp) Create a table counting number of patients according to hypertension and education level.