当前位置: 首页 > 面试题库 >

如何对pandas进行多索引分组?

子车高歌
2023-03-14
问题内容

以下是我的数据框。我进行了一些转换以创建类别列,并删除了其所属的原始列。现在,我需要进行分组,以除去公母,Love并且Fashion可以通过groupby总和来汇总。

df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
 [Love 183 474.81522 2013-11-04 374242 300x250]
 [Fashion 0 0.19434 2013-11-04 197 300x250]
 [Fashion 9 18.26422 2013-11-04 13363 300x250]]

这是我创建数据框时创建的索引

print df.index
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])

我假设我想删除索引,并创建日期和类别,multiindex然后groupby对指标进行求和。如何在熊猫数据框中执行此操作?

df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}

在Ubuntu 12.04上,Python为2.7,熊猫为0.7.0。下面是我运行以下命令时遇到的错误

import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
    {
     'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 
     'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 
     'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
    }
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()


Traceback (most recent call last):
  File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
    df.set_index(['date', 'category'], inplace=True)
  File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
    raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]

问题答案:

您可以在现有数据框上创建索引。使用提供的数据子集,这对我有用:

import pandas
df = pandas.DataFrame.from_dict(
    {
     'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 
     'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 
     'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
    }
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()

如果整个数据集存在重复索引问题,则需要稍微清理一下数据。如果可以,请删除重复的行。如果重复的行有效,那么是什么使它们彼此分开?如果可以将其添加到数据框并将其包含在索引中,那是理想的选择。如果不是,只需创建一个默认列为1的虚拟列,但是N如果有N重复则可以为2或3或…-然后将该字段也包含在索引中。

另外,我很确定您可以跳过索引创建并直接groupby使用列:

df.groupby(by=['date', 'category']).sum()

同样,这适用于您发布的数据子集。



 类似资料:
  • 问题内容: 我想对pandas进行一次透视,索引是两列,而不是一列。例如,一个字段用于年份,一个字段用于月份,一个“ item”字段显示“ item 1”和“ item 2”,以及一个“ value”字段和数值。我希望索引为年+月。 我设法做到这一点的唯一方法是将两个字段合并为一个,然后再次将其分开。有没有更好的办法? 最少的代码复制到下面。非常感谢! PS:是的,我知道关键字“ pivot”和“

  • 本文向大家介绍对Pandas MultiIndex(多重索引)详解,包括了对Pandas MultiIndex(多重索引)详解的使用技巧和注意事项,需要的朋友参考一下 创建多重索引 获得索引信息 get_level_values 基本索引 使用reindex对齐数据 数据准备 s序列加(0~-2)索引的值,因为s[:-2]没有最后两个的索引,所以为NaN.s[::2]意思是步长为1. 以上这篇对P

  • 有两个问题看起来很相似,但它们不是同一个问题:这里和这里。它们都调用的方法,例如或,我知道这会返回一个。我要问的是如何将(class)对象本身转换为。我将在下面举例说明。 构建一个示例,如下所示。 上面的应该如下所示(显然有不同的数字)。 我想做的是按列名称和采取分组(按此顺序),这样我就可以得到一个由列名称和采取构建的多索引索引,如下所示。 我如何实现这一点?如果我做了,那么是一个实例。正确的做

  • 现在,我要检索一个值: Q1:在[3.3,6.6]范围内-预期返回值:[3.3,5.5,6.6]或[3.3,3.3,5.5,6.6](包括最后一个),如果没有,则为[3.3,5.5]或[3.3,3.3,5.5]。 Q2:在[2.0,4.0]范围内-预期返回值:[3.3]或[3.3,3.3] 对于任何其他多索引维度都是相同的,例如B值: Q3:在范围[111,500]中有重复,作为范围中的数据行数-

  • 主要内容:创建分层索引,应用分层索引,分层索引切片取值,聚合函数应用,局部索引,行索引层转换为列索引,列索引实现分层,交换层和层排序分层索引(Multiple Index)是 Pandas 中非常重要的索引类型,它指的是在一个轴上拥有多个(即两个以上)索引层数,这使得我们可以用低维度的结构来处理更高维的数据。比如,当想要处理三维及以上的高维数据时,就需要用到分层索引。 分层索引的目的是用低维度的结构(Series 或者 DataFrame)更好地处理高维数据。通过分层索引,我们可以像处理二维数据

  • 问题内容: 使用以下DataFrame,如何在不让Pandas将移位后的值分配给其他索引值的情况下基于索引来移位“ beyer”列? 生产… 问题在于,佩恩特得到了分配最后一位战士的啤酒。相反,我希望它像这样… 问题答案: 使用应用转移到各组分别:(感谢Jeff指出这个简化) 如果您有一个多索引,则可以通过将一系列或级别名称传递给 参数来对多个级别进行分组。