pandas：为什么pandas.Series.std（）与numpy.std（）不同

温镜

2023-03-14

问题内容：

另一个更新：已解决（请参阅评论和我自己的答案）。

更新：这就是我要解释的。

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

答：这是通过贝塞尔校正N-1来解释的，而不是N通过标准偏差公式的分母来解释的。我希望熊猫使用与numpy相同的约定。

有一个相关的讨论在这里，但他们的建议都不能工作。

我有许多不同餐厅的数据。这是我的数据框（想象不止一家餐厅，但效果只再现了一家）：

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

问题：r.mi.groupby('restaurant_id')['price'].mean()返回每个餐厅的价格均值。我想得到标准偏差。但是，r.mi.groupby('restaurant_id')['price'].std()
返回错误的值 。

如您所见，为简单起见，我仅抽取了一家有四种食物的餐厅。我想找到价格的标准差。只想确认一下：

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

我们可以得到相同（正确）的值

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

（当然，不要理会平均的餐厅ID。）显然，np.std(df)当我有多个餐厅时，这不是解决方案。所以我正在使用groupby。

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

什么？！ 7.228416不是6.259992。

让我们再试一次。

>>> df.groupby('restaurant_id').std()

一样。

>>> df.groupby('restaurant_id')['price'].std()

一样。

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

一样。

但是，这可行：

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

问题：是否有适当的方法来汇总数据帧，所以我将获得一个带有每个餐厅标准差的新时间序列？

问题答案：

我懂了。熊猫默认使用Bessel校正-即标准差公式中使用N-1而不是N分母。正如behzad.nouri在评论中指出的那样，

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

pandas：为什么pandas.Series.std（）与numpy.std（）不同

相关阅读

相关文章

相关问答

相关工具

相关文档