Temp Expt :
I'm a newbie to Python-Pandas. I've sample dataset like
PRODUCT REGION COUNTRY MEASURE Month_ID QTY
P1 West UK M1 Mon_1 200
P1 West UK M2 Mon_1 150
P1 East JAPAN M1 Mon_1 100
P1 East JAPAN M2 Mon_1 100
P1 West UK M1 Mon_2 300
P1 West UK M2 Mon_2 450
P1 East JAPAN M1 Mon_2 500
P1 East JAPAN M2 Mon_2 600
I want data as below:
PRODUCT REGION COUNTRY MEASURE Month_ID QTY
P1 West UK M1 Mon_1 200
P1 West UK M2 Mon_1 150
P1 West UK NEW_M Mon_1 350
P1 East JAPAN M1 Mon_1 100
P1 East JAPAN M2 Mon_1 100
P1 East JAPAN NEW_M Mon_1 200
P1 West UK M1 Mon_2 300
P1 West UK M2 Mon_2 450
P1 West UK NEW_M Mon_2 750
P1 East JAPAN M1 Mon_2 500
P1 East JAPAN M2 Mon_2 600
P1 East JAPAN NEW_M Mon_2 1100
I want to group by columns (PRODUCT, REGION, COUNTRY, Month_ID)
with SUM(QTY)
.
And new rows will be added after each group with column MEASURE
as NEW_M
.
jezrael :
You can create new DataFrame by aggregate sum
, then for correct sorting is added last duplicated index with DataFrame.set_index
, so after concat
add DataFrame.sort_index
for new rows after each group:
cols = ['PRODUCT', 'REGION', 'COUNTRY', 'Month_ID']
idx = df.index[df.duplicated(cols)]
df1 = (df.groupby(cols, as_index=False, sort=False)['QTY']
.sum()
.assign(MEASURE = 'NEW_M')
.set_index(idx))
df = pd.concat([df, df1], sort=False).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
PRODUCT REGION COUNTRY MEASURE Month_ID QTY
0 P1 West UK M1 Mon_1 200
1 P1 West UK M2 Mon_1 150
2 P1 West UK NEW_M Mon_1 350
3 P1 East JAPAN M1 Mon_1 100
4 P1 East JAPAN M2 Mon_1 100
5 P1 East JAPAN NEW_M Mon_1 200
6 P1 West UK M1 Mon_2 300
7 P1 West UK M2 Mon_2 450
8 P1 West UK NEW_M Mon_2 750
9 P1 East JAPAN M1 Mon_2 500
10 P1 East JAPAN M2 Mon_2 600
11 P1 East JAPAN NEW_M Mon_2 1100
EDIT: For subtract is used small trick - values of QTY
with M2
in MEASURE
are multiple by -1
, so if aggregate sum
get difference:
#if need only `M1` and `M2` rows
df = df[df['MEASURE'].isin(['M1','M2'])]
cols = ['PRODUCT', 'REGION', 'COUNTRY', 'Month_ID']
idx = df.index[df.duplicated(cols)]
df1 = (df.assign(QTY=df['QTY'].mask(df['MEASURE'].eq('M2'),df['QTY'] * -1))
.groupby(cols, as_index=False, sort=False)['QTY']
.sum()
.assign(MEASURE = 'NEW_M')
.set_index(idx)
)
df2 = pd.concat([df, df1], sort=False).sort_index(kind='mergesort').reset_index(drop=True)
print (df2)
PRODUCT REGION COUNTRY MEASURE Month_ID QTY
0 P1 West UK M1 Mon_1 200
1 P1 West UK M2 Mon_1 150
2 P1 West UK NEW_M Mon_1 50
3 P1 East JAPAN M1 Mon_1 100
4 P1 East JAPAN M2 Mon_1 100
5 P1 East JAPAN NEW_M Mon_1 0
6 P1 West UK M1 Mon_2 300
7 P1 West UK M2 Mon_2 450
8 P1 West UK NEW_M Mon_2 -150
9 P1 East JAPAN M1 Mon_2 500
10 P1 East JAPAN M2 Mon_2 600
11 P1 East JAPAN NEW_M Mon_2 -100