Handeling Data with Pandas
The major input to any scientific computations are data (for calibration purposes, further computations, etc.). Several difficulties comes to mind in this direction:
- One dimensional: basically, long sequences of data such as time series that needs vertical efficient access
- Multi dimensional: data that are large in dimension (such as matrices, tensor) that need to be accessed horizontaly for efficient computations
- Heterogeneity: handeling data of different nature at the same time (numbers, vectors, strings, time, etc.)
Furthermore, before even organized those data along these dimensions, data have to be
- collected: how? where to store? which format? how to read?
- cleaned: missing data? mesurement errors?
- converted: strings of dates to datetime? flags to binary values?
- pre-processed: up/down sampling? Partial selection? etc.
We saw that numpy provides efficient single objects (modulo memory size) to handle the dimensionality and types (numerical, strings, etc.). However, it is quite convoluted to handle heterogeneity as a numpy array shall only contain fixed typed data.
Historically, databases, which we will address later, are the answer to these questions.
However, based on numpy, the library pandas
provides a very powerfull and handy answer to most of the points raised here above.
The Pandas library is mature and deep with a lot of functionalities.
As numpy it is imported as follows with the usual nickname pd
Pandas is based on two major components:
- Series: One dimensional labelled array or data with the following components:
data
: Numpy array (one dimensional usually)index
: labels of any type for each rowname
: denomination of the series
- Dataframes: Collection of series of different types with common index (tabular format)
Pandas Series
A pandas
series is a sequence of data
(a numpy array) together with an explicit index
as well as a name
.
import pandas as pd
import numpy as np
s = pd.Series(data = ["NY", "NY" , "London", np.nan, "SG"])
display(s)
In this case we just provide the data
part of the series, the index
is automatically integers and the name
is None.
To specify the labels (or index) of the series as well as the name we proceed as follows
pop = pd.Series(
data = [25, 22 , 8, 5],
index = ['SH', 'BJ', 'London', 'SG'],
name = 'Population'
)
display(pop)
Pandas Dataframe
A dataframe is a tabular collection of series with a common index
# Create a plain dataframe with 3 columns and 10 rows
# filled with random numbers
# generate 10x3 random data
data = np.random.rand(10, 3)
print(data)
# create the dataframe
df = pd.DataFrame(
data = data
)
display(df)
Since each column is a pandas series, the name of the column is the corresponding name of the series. As for the index, it is common for each column.
# generate a 10*3 random set of data
data = np.random.rand(10, 3)
print("Data:")
print(data)
# provides a list of names
cols = ['Tencent', 'Alibaba', 'Baidu']
print("Columns")
print(cols)
# Create a datetime index of size 10
idx = pd.date_range(start = '2024-04-01', periods = 10)
print("Index")
print(idx)
# create the dataframe
df = pd.DataFrame(
data = data,
index = idx,
columns = cols
)
display(df)
Short Infos about the Dataframe
Several options allows to inspect rapidly the nature of the Dataframe/Series
head
/tail
: shows the first/last rows of the dataframeindex
/columns
: show the index/columnsinfo
: show generic infos from the dataframedescribe
: returns basic statistics about the dataframeplot
: allows to visualize data from the dataframe
# showing the first lines (by default 5)
display(df.head())
# showing the last 7 lines
display(df.tail(7))
# show the index
display(df.index)
# shows the columns
display(df.columns)
# show the info from the dataframe
display(df.info())
# discplay a summary statistics about the dataframe
display(df.describe())
A more easier way to explore a dataframe is to plot the data available.
Note
By default pandas
will use the ploting library matplotlib
to show data, it is however not particulary nice.
Here we make use of plotly
as the library to plot data
# import plotly and tell pandas to use it as backend
import plotly.graph_objs as go
pd.options.plotting.backend = "plotly"
# generate a large dataframe
idx = pd.date_range(start = '2014-04-08', end = '2024-04-18', freq = 'B')
N = len(idx)
cols = ['Tencent', 'Baidu', 'Alibaba']
# generate random data between -0.05 and 0.05
data = 0.05 - np.random.rand(N, 3) / 10
# take the cumulative product of 1+data along the time axis
data = (1+data).cumprod(axis = 0)
df = pd.DataFrame(
data = data,
index = dates,
columns=['Tencent', 'Alibaba', 'Baidu']
)
# Plot each columns
df.plot()
# plot a specific column
df['Tencent'].plot()
# plot histogram of each
df.plot(kind = 'histogram')
Selecting Data
Selecting Columns
A collection of series, selecting specific columns of a dataframe is done as follows
# get the column Tencent (which is a series)
display(df['Tencent'])
# get a list of columns (hence a dataframe)
display(df[['Tencent', 'Baidu']])
Selecting Rows
Two main methods for the selection in terms of rows
iloc
: positional, integer wize, location (asnumpy
)loc
: in terms of labels
As for iloc
is works basically as numpy
# getting the 3rd row
display(df.iloc[3])
# getting the last row
display(df.iloc[-1])
# getting the 3rd to the 5th row
display(df.iloc[2:6])
As for loc
, the selection is done in terms of label or slice
# show a specific row for 2024-04-03
display(df.loc['2024-04-03'])
# show a range (note that the labels starts from 2024-04-01)
display(df.loc['2024-03-15': '2024-04-10'])
# show everything after '2024-04-07'
display(df.loc['2024-04-07':])
# the output of a range is a dataframe, we can select a subsection of columns
df.loc['2024-04-03':'2024-04-18'][['Tencent', 'Alibaba']]
Read and Write Data
Generating random dataframe or defining per hand the data has limited use, since our goal is to handle large sets of data of heterogeneous type. Data are usually available in different format
- file data: (among others)
csv
: text files comma (or tab) separated values files. Most basic, most common. Extension is usually.csv
or.tvs
.excel
: excel files. More advanced files with particular structure (several spreadsheets for instance). Extension is usually.xls
or.xlsx
pickle
: python data storage. Rarely used as it is not that efficient and dependent on the version of python used to save or load. Contains however all the information of the dataframe. Extension is.pkl
hdf
: cross platform storage of data and structure for large datasets. Used quite often for bio data or physics. Extension is.hdf
orhd5
.- ...
- databases: Either online or offline. We will see that later.
To read specific data files stored in folder ./data/
where ./
is where your python script is located and running, loading data is done as follows:
# read csv file
df = pd.read_csv('./data/my_dataset.csv')
# read tab separated values file
df = pd.read_csv('./data/my_dataset.tvs', sep = '\t')
# read (the first sheet of) an excel file
df = pd.read_excel('./data/my_dataset.xls')
# read the third sheet of an excel file
df = pd.read_excel('./data/my_dataset.xls', sheet_name = 2)
# read the sheet of an excel file with sheet name 'sheet05'
df = pd.read_excel('./data/my_dataset.xls', sheet_name = 'sheet05')
csv
).
# save the new dataframe into a new file `my_data_20240428.csv` in the subfolder ./data/
df.to_csv('./data/my_data_20240428.csv')
DUMP EXCEL
Excel has been used for many years as the prefered hybrid solution from your uncle to handle tabular data with point and click. However, firstly, the format is not open source, complex, buggy, not able to handle large datasets, slow, and usually not platform compatible. Secondly, you are younger and more knowladgable about tech than your uncle: It is a bad idea in 2024 to handle data in this format, so forget about excel alltogether as a format and platform.
Organizing and Cleaning Data
When you read a dataset into a dataframe, pandas tries to guess as well as it can the structure and type (detecting if there is a header for the column names, or the type of each column). However, most of the time, you have to proceed first first with the following operations
- Convert each column to the correct format
- Change the column names
- Set an index (add a new one or set one existing column as an index)
For this section we will use the dataset csi_short.csv
as well as movies.xls
and suppose that it is stored into the folder ./data/csi_short.csv
and ./data/movies.xls
.
You can download the datasets from here if you don't have them already.
We start with csi_short
# load the dataset
dfcsi = pd.read_csv('./data/csi_short.csv')
# Check content
display(dfcsi.head())
# Check datatype of the columns
display(dfcsi.dtypes)
The data set consists of 5 daily stock prices.
- We convert the column of dates to a datetime format
- We rename the column
Date
todate
- We set the newly renamed column
date
as an index - We rename the stocks columns to
SXX
whereXX
stands for01
,02
, ...,15
, ...
# conversion to a datetime format of the column date
dfcsi['Date'] = pd.to_datetime(dfcsi['Date'])
# Rename the column `Date` to `date` using dictionarry assignment
dfcsi = dfcsi.rename(columns = {'Date': 'date'})
# Set the colum date as index
dfcsi = dfcsi.set_index('date')
# rename the stock columns from their names to SXX (we have N stocks where N is the number of columns)
COLSNAMES = [f"S{i:02}" for i in range(1,len(dfcsi.columns)+1)]
print(COLSNAMES)
# Assignment from the column names as array
dfcsi.columns = COLSNAMES
display(dfcsi)
Let us handle movies.xls
# load the excel data
dfmovies = pd.read_excel('./data/movies.xls')
# show the basic infos
display(dfmovies.info())
while looking at the infos you will see that pandas overdone it in terms of conversion of columns
year
in the original file is a string which it converted into an int. We want a dateduration
is in minutes which is converted into an int. We want a timedelta.Aspect ratio
is given as 16/3, 16/9... in the original file which has been converted to a float. We want a string.Facebook Likes - Actor 3
as well asReviews by Crtiics
unlike the other columns is not converted to an int. We want all Facebook and revies by users likes to be converted to int.Reviews by Crtiics
is mispelled and shoud be renamed toReviews by Critics
NaN
values and int
If you rapidly check, you will see that you can not plainly convert some float columns you know are integers such as Facebook Likes - Actor 3
since there are some nan values.
In this case, the column shall be casted to the large int datatype of pandas pd.Int64Dtype()
You can naturally handle these columns one by one, however, you can specify pandas while loading the csv file the dtype you want and then convert the columns that are necessary.
# Set the dtypes as dictionarry before loading the csv
DTYPES = {
"Title": str,
"Year": str, # We will convert the year into datetime object so as string it is easier
"Genres": str,
"Language": str,
"Country": str,
"Content Rating": str,
"Duration": pd.Int64Dtype(), # the duration is in minutes, we will convert it in timedelta
"Aspect Ratio": str, # The aspect ratio is of the form 16/9
"Budget": pd.Int64Dtype(),
"Gross Earnings": pd.Int64Dtype(),
"Director": str,
"Actor 1": str,
"Actor 2": str,
"Actor 3": str,
'Facebook Likes - Director': pd.Int64Dtype(),
'Facebook Likes - Actor 1': pd.Int64Dtype(),
'Facebook Likes - Actor 2': pd.Int64Dtype(),
'Facebook Likes - Actor 3': pd.Int64Dtype(),
'Facebook Likes - cast Total': pd.Int64Dtype(),
'Facebook likes - Movie': pd.Int64Dtype(),
'Facenumber in posters': pd.Int64Dtype(),
'User Votes': pd.Int64Dtype(),
'Reviews by Users': pd.Int64Dtype(),
'Reviews by Crtiics': pd.Int64Dtype(),
'IMDB Score': float,
}
# reload the data by specifying the dictionarry
dfmovies = pd.read_excel("./data/movies.xls", dtype=DTYPES)
# rename the column Reviews by Crtiics to Reviews by Critics
dfmovies = dfmovies.rename(columns={"Reviews by Crtiics": "Reviews by Critics"})
# Convert the column year to datetime (string %Y)
dfmovies["Year"] = pd.to_datetime(dfmovies["Year"], format="%Y")
# Convert the column duration from number of minutes to a timedelta
dfmovies["Duration"] = pd.to_timedelta(dfmovies["Duration"], unit="m")
# show the info
dfmovies.info()