Skip to content

Data Manipulations

Those manipulations are usually akin to database functionalities. For simplicity we separate into two parts

  • concat, merge, pivot: combine data from different dataframes or modify the structure of the organization of the current dataframe. Usually does not involve any computations.
  • groupby: Aggregation of data (partitioning) to perform further computations.

Combining/Reorganizing data

We handle here some most used functionalities. For an overall presentation, we refer to the pandas tutorial Merge, join, concatenate and compare.

Concatenation

Given a list of dataframe [df1, ..., dfN], concatenation intends to return a single dataframe with df1,..., dfN stacked either vertically (axis = 0) or horizontally (axis = 1). Clearly, in order to get a meaningful concatenation, the columns or index among the dataframes to be concatenated should share common elements.

import pandas as pd

data1 = [
    [1, 'x'],
    [2, 'y'],
    [3, 'z']
]
data2 = [
    [2, 'a'],
    [3, 'b'],
    [4, 'z']
]

# two dataframes with common columns
df1 = pd.DataFrame(
    data = data1,
    columns = ['A', 'B'],
    index = [0, 1, 2]
)
df2 = pd.DataFrame(
    data = data2,
    columns = ['A', 'B'],
    index = [3, 4, 5]
)

# concatenate vertically (common columns, distinct index)
df = pd.concat([df1, df2], axis = 0)
display(df1)
display(df2)
print("vertical concatenation")
display(df)     # 

# two dataframes with common columns
df1 = pd.DataFrame(
    data = data1,
    columns = ['A', 'B'],
    index = [0, 1, 2]
)

df2 = pd.DataFrame(
    data = data2,
    columns = ['C', 'D'],
    index = [0, 1, 2]
)

# concatenate horizontally (common index, distinct columns)
df = pd.concat([df1, df2], axis = 1)
display(df1)
display(df2)
print("horizontal concatenation")
display(df)     # 

When to concat and when better not?

Even if concat handles mismatched in index and columns, I strongly advise to limit the use concat in the following situations

  • Vertical concatenation (axis=0):
    • the index of each dataframe are disjoint
    • the columns of each dataframe are the same
  • horizontal concatenation (axis=1):
    • the index of each dataframe are the same
    • the columns of each dataframe are disjoint

For other situations, the merge functionality is better though it handles only two dataframes at the same time.

Merging

Given two dataframes left_df and right_df, the merge operation will provide a SQL type of merging between these two.

Though it can be done on indexed dataframes, it is easier to get on non indexed ones (use the reset_index() to get the dataframes to be merged correctly).

The basic operation is pd.merge(left_df, right_df, on = ..., how = ...)

Where

  • on: is the common column name (or a list thereof) in the two dataframes.
  • how: is the method how the two dataframes are combined according to on='col' where 'col' is the common column name in each dataframe.

    • left: use the left dataframe col. That is, for each row in the left dataframe, it concatenate with the row from the right dataframe if the col value is identical, otherwize it fill the content with nan.
    • right: use the right dataframe col (same as swapping left and right) in the previous point.
    • inner: Use the intersection of values in left and right col.
    • outer: use the union of the values in left and right col
data1 = [
    [1, 'x'],
    [2, 'y'],
    [3, 'z']
]
data2 = [
    [2, 23.5],
    [3, 11.2],
    [4, 6.0]
]

# two dataframe without specified index
# they share the sam column name `ints` #
left_df = pd.DataFrame(
    data = data1,
    columns = ['ints', 'letters']
)
right_df = pd.DataFrame(
    data = data2,
    columns = ['ints', 'floats']
)

# left/right/inner/outer merging
l_join = pd.merge(left_df, right_df, on='ints', how = 'left')
r_join = pd.merge(left_df, right_df, on='ints', how = 'right')
i_join = pd.merge(left_df, right_df, on='ints', how = 'inner')
o_join = pd.merge(left_df, right_df, on='ints', how = 'outer')


display(l_join)  # 
display(r_join)  # 
display(i_join)  # 
display(o_join)  # 

There many more options

  • the on can be a list of columns can be passed, on which each tuple of this list is the key for the merging operation.
  • If the names of the columns in the two dataframes are different you can use left_on = 'some_col with right_on='other_col'
  • If you want to merge on index rather than some col, you can use left_index = True and/or right_index = True
  • Columns outside of the merge condition that share the same name will be appended with a suffix _x and _y in the resulting dataframe.

Pivot

The function pivot allows to transform a dataframe along a new column and index. As illustration, let us consider the simple dataframe without specific index.

Item CType USD EUR
Item0 Gold 1.2$ 1€
Item0 Bronze 2.4$ 2€
Item1 Gold 3.6$ 3€
Item1 Silver 4.8$ 4€
# declare the data
data = np.array(
    [
        ["Item0", "Item0", "Item1", "Item1"],
        ["Gold", "Bronze", "Gold", "Silver"],
        ["1.2$", "2.4$", "3.6$", "4.8$"],
        ["1€", "2€", "3€", "4€"],
    ]
)
# create the dataframe
df = pd.DataFrame(data=data)
# the dataframe is not in the right direction, we transpose
df = df.T
# set the column names
df.columns = ["Item", "CType", "USD", "EU"]
display(df)

We want to pivot the table to get Item as index and CType as columns with values in $ as data.

p = df.pivot(index="Item", columns="CType", values="USD")

display(p) # 

We want to pivot the table to get Item as index and CType as columns but with all the values ($ and E). Clearly this is not possible in a plain tabular format so that it will create a multilevel index for the columns

p = df.pivot(index="Item", columns="CType")

display(p) # 

Groupby

The groupby functionality is another SQL type functionality to perform computation based on a partitioning of the data.

The functionality can be entailed into three steps

  • Partition: split the dataframe into a partition of sub dataframes
  • Apply: Apply a function on each of these partitions
  • Combine: return the results of the function on each element of this partition
import pandas as pd

data = [
    ["Math", "Master", 87],
    ["Math", "Master", 76],
    ["CS", "PhD", 91],
    ["Physics", "Master", 84],
    ["Math", "PhD", 96],
    ["CS", "Master", 72],
    ["CS", "Master", 81],
    ["Math", "PhD", 98],
    ["Physics", "PhD", 87],
    ["Physics", "PhD", 85],
    ["Physics", "Master", 45],
]
cols = ["Dept", "Level", "Grade"]
df = pd.DataFrame(data = data, columns = cols)
display(df) # 

# Computation of average grade
# per dept
avg_dept = df.groupby('Dept')['Grade'].mean()
# per level
avg_level = df.groupby('Level')['Grade'].mean()
# per dept and level (multiindex dept/level)
avg_dept_level = df.groupby(['Dept', 'Level'])['Grade'].mean()

display(avg_dept)       # 
display(avg_level)      # 
display(avg_dept_level) # 

In this simple example, we performed the full chain partition (groupby), apply (mean to column 'Grade').

The groupby performs the partition, it returns an iterator running through each sub dataframe from that partition.

dept_grp = df.groupby('Dept')

for name, subdf in dept_grp:
    print(f"Name:\t{name}")
    print("with sub dataframe as content:")
    print(subdf)

In the main example we just performed one standard function one one of the columns of the groupby. It is possible to apply

  • user defined functions (do not invent the wheel, if the function is already defined in pandas use it)
  • several functions
  • a single function that treats everything.
# take a numpy array and return the max - min
def maxmin(x):
    return x.max() - x.min()

result1 = df.groupby('Dept')['Grade'].agg(maxmin).rename(columns = {'Grade': 'Maxmin'})
display(result1)        # 

# apply different functions to different columns and return the result
result2 = df.groupby('Dept').agg(
    Counting = ('Level', 'count'),
    Average = ('Grade', 'mean'),
    Std = ('Grade', 'std'),
    Maxmin = ('Grade', maxmin)
)
print(result2)          # 

# define a single function that will apply to each subdataframes

def treatment(dftmp):
    # the Function will get as input each subdataframes
    count = dftmp.size
    mean = dftmp['Grade'].mean()
    std = dftmp['Grade'].std()
    mixing_twocols = dftmp['Level'].iloc[0] + ' with grade ' + dftmp['Grade'].astype(str).iloc[-1]
    # we return a pandas series with the results
    return pd.Series(
        {
            'Count': count,
            'Mean': mean,
            'Std': std,
            'Some stupid things': mixing_twocols
        }
    )
result3 = df.groupby('Dept').apply(treatment)
print(result3)          #