Data Manipulations
Those manipulations are usually akin to database functionalities. For simplicity we separate into two parts
concat
,merge
,pivot
: combine data from different dataframes or modify the structure of the organization of the current dataframe. Usually does not involve any computations.groupby
: Aggregation of data (partitioning) to perform further computations.
Combining/Reorganizing data
We handle here some most used functionalities. For an overall presentation, we refer to the pandas tutorial Merge, join, concatenate and compare.
Concatenation
Given a list of dataframe [df1, ..., dfN]
, concatenation intends to return a single dataframe with df1
,..., dfN
stacked either vertically (axis = 0
) or horizontally (axis = 1
).
Clearly, in order to get a meaningful concatenation, the columns or index among the dataframes to be concatenated should share common elements.
import pandas as pd
data1 = [
[1, 'x'],
[2, 'y'],
[3, 'z']
]
data2 = [
[2, 'a'],
[3, 'b'],
[4, 'z']
]
# two dataframes with common columns
df1 = pd.DataFrame(
data = data1,
columns = ['A', 'B'],
index = [0, 1, 2]
)
df2 = pd.DataFrame(
data = data2,
columns = ['A', 'B'],
index = [3, 4, 5]
)
# concatenate vertically (common columns, distinct index)
df = pd.concat([df1, df2], axis = 0)
display(df1)
display(df2)
print("vertical concatenation")
display(df) #
# two dataframes with common columns
df1 = pd.DataFrame(
data = data1,
columns = ['A', 'B'],
index = [0, 1, 2]
)
df2 = pd.DataFrame(
data = data2,
columns = ['C', 'D'],
index = [0, 1, 2]
)
# concatenate horizontally (common index, distinct columns)
df = pd.concat([df1, df2], axis = 1)
display(df1)
display(df2)
print("horizontal concatenation")
display(df) #
When to concat and when better not?
Even if concat
handles mismatched in index and columns, I strongly advise to limit the use concat
in the following situations
- Vertical concatenation (
axis=0
):- the index of each dataframe are disjoint
- the columns of each dataframe are the same
- horizontal concatenation (
axis=1
):- the index of each dataframe are the same
- the columns of each dataframe are disjoint
For other situations, the merge
functionality is better though it handles only two dataframes at the same time.
Merging
Given two dataframes left_df
and right_df
, the merge
operation will provide a SQL
type of merging between these two.
Though it can be done on indexed dataframes, it is easier to get on non indexed ones (use the reset_index()
to get the dataframes to be merged correctly).
The basic operation is pd.merge(left_df, right_df, on = ..., how = ...)
Where
on
: is the common column name (or a list thereof) in the two dataframes.-
how
: is the method how the two dataframes are combined according toon='col'
where'col'
is the common column name in each dataframe.left
: use the left dataframecol
. That is, for each row in the left dataframe, it concatenate with the row from the right dataframe if thecol
value is identical, otherwize it fill the content withnan
.right
: use the right dataframecol
(same as swapping left and right) in the previous point.inner
: Use the intersection of values in left and rightcol
.outer
: use the union of the values in left and rightcol
data1 = [
[1, 'x'],
[2, 'y'],
[3, 'z']
]
data2 = [
[2, 23.5],
[3, 11.2],
[4, 6.0]
]
# two dataframe without specified index
# they share the sam column name `ints` #
left_df = pd.DataFrame(
data = data1,
columns = ['ints', 'letters']
)
right_df = pd.DataFrame(
data = data2,
columns = ['ints', 'floats']
)
# left/right/inner/outer merging
l_join = pd.merge(left_df, right_df, on='ints', how = 'left')
r_join = pd.merge(left_df, right_df, on='ints', how = 'right')
i_join = pd.merge(left_df, right_df, on='ints', how = 'inner')
o_join = pd.merge(left_df, right_df, on='ints', how = 'outer')
display(l_join) #
display(r_join) #
display(i_join) #
display(o_join) #
There many more options
- the
on
can be a list of columns can be passed, on which each tuple of this list is the key for the merging operation. - If the names of the columns in the two dataframes are different you can use
left_on = 'some_col
withright_on='other_col'
- If you want to merge on index rather than some col, you can use
left_index = True
and/orright_index = True
- Columns outside of the merge condition that share the same name will be appended with a suffix
_x
and_y
in the resulting dataframe.
Pivot
The function pivot
allows to transform a dataframe along a new column and index.
As illustration, let us consider the simple dataframe without specific index.
Item | CType | USD | EUR |
---|---|---|---|
Item0 | Gold | 1.2$ | 1€ |
Item0 | Bronze | 2.4$ | 2€ |
Item1 | Gold | 3.6$ | 3€ |
Item1 | Silver | 4.8$ | 4€ |
# declare the data
data = np.array(
[
["Item0", "Item0", "Item1", "Item1"],
["Gold", "Bronze", "Gold", "Silver"],
["1.2$", "2.4$", "3.6$", "4.8$"],
["1€", "2€", "3€", "4€"],
]
)
# create the dataframe
df = pd.DataFrame(data=data)
# the dataframe is not in the right direction, we transpose
df = df.T
# set the column names
df.columns = ["Item", "CType", "USD", "EU"]
display(df)
We want to pivot the table to get Item
as index and CType
as columns with values in $ as data.
We want to pivot the table to get Item
as index and CType
as columns but with all the values ($ and E).
Clearly this is not possible in a plain tabular format so that it will create a multilevel index for the columns
Groupby
The groupby
functionality is another SQL
type functionality to perform computation based on a partitioning of the data.
The functionality can be entailed into three steps
- Partition: split the dataframe into a partition of sub dataframes
- Apply: Apply a function on each of these partitions
- Combine: return the results of the function on each element of this partition
import pandas as pd
data = [
["Math", "Master", 87],
["Math", "Master", 76],
["CS", "PhD", 91],
["Physics", "Master", 84],
["Math", "PhD", 96],
["CS", "Master", 72],
["CS", "Master", 81],
["Math", "PhD", 98],
["Physics", "PhD", 87],
["Physics", "PhD", 85],
["Physics", "Master", 45],
]
cols = ["Dept", "Level", "Grade"]
df = pd.DataFrame(data = data, columns = cols)
display(df) #
# Computation of average grade
# per dept
avg_dept = df.groupby('Dept')['Grade'].mean()
# per level
avg_level = df.groupby('Level')['Grade'].mean()
# per dept and level (multiindex dept/level)
avg_dept_level = df.groupby(['Dept', 'Level'])['Grade'].mean()
display(avg_dept) #
display(avg_level) #
display(avg_dept_level) #
In this simple example, we performed the full chain partition (groupby
), apply (mean
to column 'Grade').
The groupby
performs the partition, it returns an iterator running through each sub dataframe from that partition.
dept_grp = df.groupby('Dept')
for name, subdf in dept_grp:
print(f"Name:\t{name}")
print("with sub dataframe as content:")
print(subdf)
In the main example we just performed one standard function one one of the columns of the groupby. It is possible to apply
- user defined functions (do not invent the wheel, if the function is already defined in pandas use it)
- several functions
- a single function that treats everything.
# take a numpy array and return the max - min
def maxmin(x):
return x.max() - x.min()
result1 = df.groupby('Dept')['Grade'].agg(maxmin).rename(columns = {'Grade': 'Maxmin'})
display(result1) #
# apply different functions to different columns and return the result
result2 = df.groupby('Dept').agg(
Counting = ('Level', 'count'),
Average = ('Grade', 'mean'),
Std = ('Grade', 'std'),
Maxmin = ('Grade', maxmin)
)
print(result2) #
# define a single function that will apply to each subdataframes
def treatment(dftmp):
# the Function will get as input each subdataframes
count = dftmp.size
mean = dftmp['Grade'].mean()
std = dftmp['Grade'].std()
mixing_twocols = dftmp['Level'].iloc[0] + ' with grade ' + dftmp['Grade'].astype(str).iloc[-1]
# we return a pandas series with the results
return pd.Series(
{
'Count': count,
'Mean': mean,
'Std': std,
'Some stupid things': mixing_twocols
}
)
result3 = df.groupby('Dept').apply(treatment)
print(result3) #