问题：

为什么DataFrame.loc[[1]]比df.ix[[1]]慢1800x，比df.loc[1]慢3500x？

慎俊艾

2023-03-14

自己试试这个：

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

更新：这是熊猫中的一个合法错误，可能是在2014年8月左右的0.15.1版本中引入的。解决方法：在使用旧版本的pandas时等待新版本的发布；从github获得最新的开发版本；在您发布的pandas中手动执行单行修改；暂时使用.ix而不是.loc。

我有一个480万行的数据帧，使用.iloc[[id]（带有一个元素列表）选择一行需要489毫秒，几乎半秒，比相同的.ix[[id]]慢1800倍，比.iloc[id]慢35000倍（将id作为值传递，而不是作为列表传递）。公平地说，.loc[list]所用的时间与列表长度无关，但我不想在上面花费489毫秒，特别是当.ix快一千倍，并且产生相同的结果时。我的理解是，.ix应该更慢，不是吗？

我使用的是熊猫0.15.1。关于索引和选择数据的优秀教程表明，.ix在某种程度上比.loc和.iloc更通用，而且可能更慢。具体来说，它说

但是，如果轴是基于整数的，则仅支持基于标签的访问，而不支持位置访问。因此，在这种情况下，通常最好显式地使用.iloc或.loc。

下面是一个iPython会议，其中包括基准测试：

    print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
    print 'df.index begins with ', df.index[:20]
    print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())

    # First extract one element directly. Expected result, no issues here.
    id=5965356
    print 'Extract one element with id %d' % id
    %timeit df.loc[id]
    %timeit df.ix[id]
    print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result

    # Now extract this one element as a list.
    %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
    %timeit df.ix[[id]] 
    print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]]))  # this one should be True
    # Let's double-check that in this case .ix is the same as .loc, not .iloc, 
    # as this would explain the difference.
    try:
        print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
    except:
        print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))

    # Finally, for the sake of completeness, let's take a look at iloc
    %timeit df.iloc[3456789]    # this is still 100+ times faster than the next version
    %timeit df.iloc[[3456789]]

输出：

The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with  Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop

共有2个答案

吴飞语

2023-03-14

熊猫索引速度太慢了，我换成了numpy索引

df=pd.DataFrame(some_content)
# takes forever!!
for iPer in np.arange(-df.shape[0],0,1):
    x = df.iloc[iPer,:].values
    y = df.iloc[-1,:].values
# fast!        
vals = np.matrix(df.values)
for iPer in np.arange(-vals.shape[0],0,1):
    x = vals[iPer,:]
    y = vals[-1,:]

柴丰

2023-03-14

看起来这个问题在熊猫0.14中不存在。我用line_profiler描述了它，我想我知道发生了什么。由于熊猫0.15.1，如果给定的索引不存在，现在会引发KeyError。看起来，当您使用. loc[list]语法时，它正在沿着整个轴对索引进行详尽的搜索，即使已经找到了。也就是说，首先，在找到元素的情况下不会提前终止，其次，在这种情况下搜索是暴力的。

/anaconda/lib/python2.7/site-包/熊猫/核心/indexing.py

  1278                                                       # require at least 1 element in the index
  1279         1          241    241.0      0.1              idx = _ensure_index(key)
  1280         1       391040 391040.0     99.9              if len(idx) and not idx.isin(ax).any():
  1281                                           
  1282                                                           raise KeyError("None of [%s] are in the [%s]" %

类似资料：

为什么if True比if 1慢？

为什么在Python中比慢？难道不应该比快吗？我试图学习模块。从基础开始，我尝试了这些： null 注意：我运行三次，取结果的平均值，然后将时间和代码一起张贴在这里。这个问题与如何做微基准测试无关（我在这个例子中做了，但我也明白它太基础了），而是为什么检查一个‘真’变量比一个常量慢。
为什么TensorFlow 2比TensorFlow 1慢得多？

许多用户认为这是切换到 Pytorch 的原因，但我还没有找到牺牲最重要的实际质量、速度来换取急切执行的理由/解释。下面是代码基准测试性能，TF1与TF2-TF1的运行速度从47%到276%不等。我的问题是：在图形或硬件级别，是什么导致了如此显着的减速？寻找详细的答案-我已经熟悉广泛的概念。相关Git 规格：CUDA 10.0.130、cuDNN 7.4.2、Python 3.7.4、Win
为什么MAX（）比ORDER BY...限制1慢100倍？

我有一个表，其中有（其他20个）列、和，以及和的索引。该表有大约500k行。为什么以下to查询在速度上差异如此之大？查询A需要0.3秒，而查询B需要28秒。查询A 我使用MySQL5.1.34。
为什么MySQL insert比JDBC慢？
为什么\d比[0-9]慢？

我昨天对一个答案发表了评论，其中有人在正则表达式中使用了，而不是或。我说使用范围或数字说明符可能比使用字符集更快。我决定今天测试一下，并惊讶地发现（至少在C#regex引擎中）似乎比其他两个似乎没有太大区别的任何一个都慢。这是我的测试输出超过10000个随机字符串，其中包含1000个随机字符，其中5077个实际上包含一个数字：这对我来说是一个惊喜，有两个原因，如果有人能解释一下，我会很感兴趣：
为什么max比排序慢？

我发现比 Python 2 和 3 中的函数慢。 Python 2 蟒蛇 3 为什么＜code＞max＜/code＞（＜code＞O（n）＜/code>）比＜code＞sort＜/code＞函数（＜code＜O（nlogn）＜/code＞）慢？

为什么DataFrame.loc[[1]]比df.ix[[1]]慢1800x，比df.loc[1]慢3500x？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档