Reversed cumulative sum of a column in pandas.DataFrame

谷光誉

2023-03-14

问题内容：

I’ve got a pandas DataFrame with a boolean column sorted by another column and
need to calculate reverse cumulative sum of the boolean column, that is,
amount of true values from current row to bottom.

Example

In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) })

In [15]: df = df.sort_values('B')

In [16]: df
Out[16]:
       A         B
6  False  0.037710
2   True  0.315414
4  False  0.332480
7  False  0.445505
3  False  0.580156
1   True  0.741551
5  False  0.796944
0   True  0.817563

I need something that will give me a new column with values

That is, for each row it should contain amount of True values on this row and
rows below.

I’ve tried various methods using .iloc[::-1] but result is not that is
desired.

It looks like I’m missing some obvious bit of information. I’ve starting using
Pandas only yesterday.

问题答案：

Reverse column A, take the cumsum, then reverse again:

df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]

import pandas as pd
df = pd.DataFrame(
    {'A': [False, True, False, False, False, True, False, True],
     'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],},
     index=[6, 2, 4, 7, 3, 1, 5, 0])
df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]
print(df)

yields

       A         B  C
6  False  0.037710  3
2   True  0.315414  3
4  False  0.332480  2
7  False  0.445505  2
3  False  0.580156  2
1   True  0.741551  2
5  False  0.796944  1
0   True  0.817563  1

Alternatively, you could count the number of Trues in column A and
subtract the (shifted) cumsum:

In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
Out[113]: 
6    3
2    3
4    2
7    2
3    2
1    2
5    1
0    1
Name: A, dtype: object

But this is significantly slower. Using IPython to
perform the benchmark:

In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)})

In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
10 loops, best of 3: 19.8 ms per loop

In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1]
1000 loops, best of 3: 701 µs per loop

Reversed cumulative sum of a column in pandas.DataFrame

相关阅读

相关文章

相关问答