Download presentation
Presentation is loading. Please wait.
1
Data Science with Python Pandas
Andrea Bizzego WebValley 2015
2
Overview Series DataFrame Pandas for Time Series
Merging, Joining, Concatenate Importing data A simple example the python commands will be written here this is a comment
3
Set it up! Open a Terminal Start ipython notebook
Open ipython notebook web-page (localhost:8888) Open ‘tutorial_pandas.ipynb’ $ ipython notebook Shift-tab for info about the function (2 times for help)
4
Pandas library The Pandas library provides useful functions to:
Represent and manage data structures Ease the data processing With built-in functions to manage (Time) Series It uses numpy, scipy, matplotlib functions Manual PDF ONLINE import pandas as pd to import the pandas library pd.__version__ get the version of the library (0.16)
5
Series: data structure
Unidimensional data structure Indexing automatic manual ! not univocally ! data = [1,2,3,4,5] s = pd.Series(data) s s.index s = pd.Series(data, index = ['a','b','c','d','d']) s['d'] s[[4]] try with: s = pd.Series(data, index = [1,2,3,4,4]) s.index = [1,2,3,4,5]
6
Series: basic operations
Mathematically, Series are vectors Compatible with numpy functions Some basic functions available as pandas methods Plotting (based on matplotlib) import numpy as np import numpy to get some mathematical functions random_data = np.random.uniform(size=10) s = pd.Series(random_data) s+1 try other mathematical functions: **2, *2, exp(s), … s.apply(np.log) s.mean() try other built-in functions. Use 'tab' to discover … s.plot() try changing the indexes of s
7
DataFrame: data structure
Bidimensional data structure A dictionary of Series, with shared index → each column is a Series Indexed, cols and rows (not univocally) s1 = pd.Series([1,2,3,4,5], index = list('abcde')) data = {'one':s1**s1, 'two':s1+1} df = pd.DataFrame(data) df.columns df.index index, columns: assign name (if not existing), or select s2 = pd.Series([1,2,3,4,10], index = list('edcbh')) df['three'] = s2 try changing s2 indexes,
8
DataFrame: accessing values - 1
keep calm select columns and rows to obtain Series query function to select rows data = np.random.randn(5,2) df = pd.DataFrame(data, index = list('abcde'), columns = ['one','two']) col = df.one row = df.xs('b') type(col) and type(row) is Series,you know how to manage ... df.query('one > 0') df.index = [1,2,3,4,5] df.query('1 < index < 4')
9
DataFrame: accessing values - 2
… madness continues ix access by index: works on rows, AND on columns iloc access by position you can extract Series ! define a strategy, and be careful with indexes ! data = np.random.randn(5,2) df = pd.DataFrame(data, index = list('abcde'), columns = ['one','two']) df.ix['a'] try df.ix[['a', 'b'], 'one'], types df.iloc[1,1] try df.iloc[1:,1], types? df.ix[1:, 'one'] works as well...
10
DataFrame: basic operations
DataFrames can be considered as Matrixes Compatible with numpy functions Some basic functions available as pandas methods axis = 0: column-wise axis = 1: row-wise self.apply() function Plotting (based on matplotlib) df_copy = df it is a link! Use df_copy = df.copy() df * df np.exp(df) df.mean() try df.mean(axis = 1) try type(df.mean()) df.apply(np.mean) df.plot() try df.transpose().plot()
11
Pandas for Time Series Used in financial data analysis, we will use for signals TimeSeries: Series when the index is a timestamp Pandas functions for Time Series (here) Useful to select a portion of signal (windowing) query method: not available on Series → convert to a DataFrame times = np.arange(0, 60, 0.5) data = np.random.randn(len(times)) ts = pd.Series(data, index = times) ts.plot() epoch = ts[(ts.index > 10) & (ts.index <=20)] epoch.plot() ts_df = pd.DataFrame(ts) ts_df.query('10 < index <=20')
12
Few notes about Timestamps
Absolute timestamps VS Relative timestamps Absolute timestamp is important for synchronization Unix Timestamps VS date/time representation (converter) Unix Timestamp: reference for signal processing = 1970, 1st January, 00:00: date/time: easier to understand unix timestamp: easier to select/manage Pandas functions to manage Timestamps import datetime import time now_dt = datetime.datetime.now() now_dt = time.ctime() now_ut = time.time() find out how to convert datetime <--> timestamp ts.index = ts.index + now_ut ts.index = pd.to_datetime(ts.index, unit = 's') ts[(ts.index > -write date time here-)] ts.plot()
13
Merge, Join, Concatenate
Simple examples here (concatenate, append) SQL-like functions (join, merge) Refer to chapter 17 of Pandas Manual Cookbooks here df1 = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C']) df2 = pd.DataFrame(np.random.randn(6, 3), columns=['D', 'E', 'F']) df3 = df1.copy() df = pd.concat([df1, df2]) df = df1.append(df2) try df = df1.append(df3) try df = df1.append(df3, ignore_index = True)
14
Importing data data_df = pd.read_table(FILE, sep = ',', skiprows = 5, header = True, usecols = [0,1,3], index_col = 0, nrows=10) FILE = '/path/to/sample_datafile.txt' data_df = pd.read_table(...) try header = 0, names = ['col1','col2', 'col3'] and adjust skiprows try nrows=None data_df.plot() data = pd.read_table(FILE, sep = ',', skiprows=[0,1,2,3,4,5,7], header=2, index_col=0) empirical solution data.plot()
15
Simple feature extraction example
import pandas as pd WINLEN = # length of window WINSTEP = # shifting step data = pd.read_table(..., usecols=[0,1]) # import data t_start = data.index[0] # start first window t_end = t_start + WINLEN # end first window feat_df = pd.DataFrame() # initialize features df while (t_end < data.index[-1]): # cycle data_curr = data.query(str(t_start)+'<=index<'+str(t_end)) extract portion of the signal mean_ = data_curr.mean()[0] # extract mean; why [0]? sd_ = data_curr.std()[0] # extract … feat_row = pd.DataFrame({'mean':mean_, 'sd':sd_}, index=[t_start]) # merge features feat_df = feat_df.append(feat_row) # append to features df t_start = t_start + WINSTEP # shift window start t_end = t_end + WINLEN # shift window end feat_df.plot()
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.