问题：

在熊猫1.2.0或更新版本中通过相应的列标题查找值

朱华皓

2023-03-14

这篇文章试图作为规范资源，在pandas版本1.2.0和更新版本中查找相应的行-列对。

以前对这类问题的一些回答（现已过时）：

数据帧上的矢量化查找

目前对这个问题的一些回答：

引用对应于列标题的DataFrame值
Pandas/Python：如何根据其他列的值创建新列，并将额外的条件应用于此新列

给定以下数据帧：

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   B  4  8

我希望能够在Col中指定的列中查找相应的值：

我希望我的结果如下所示：

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

给定以下数据帧：

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]}, 
                  index=[0, 2, 8, 9])

  Col  A  B
0   B  1  5
2   A  2  6
8   A  3  7
9   B  4  8

我希望保留索引，但仍能找到正确的对应值：

  Col  A  B  Val
0   B  1  5    5
2   A  2  6    2
8   A  3  7    3
9   B  4  8    8

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

    Col  A  B
C E   B  1  5
  F   A  2  6
D E   A  3  7
  F   B  4  8

我希望保留索引，但仍能找到正确的对应值：

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

给定以下数据帧

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   C  4  8  # Column C does not correspond with any column

如果存在相应的值，我希望查找相应的值，否则我希望将其默认为0

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0  # Default value 0 since C does not correspond

给定以下数据帧：

   Col  A  B
0    B  1  5
1    A  2  6
2    A  3  7
3  NaN  4  8  # <- Missing Lookup Key

我希望Col中的任何NaN值在Val

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # NaN to indicate missing

共有3个答案

闾丘树

2023-03-14

另一个选项是构建查找列的元组，枢转数据框，并选择具有元组的相关列：

cols = [(ent, ent) for ent in df.Col.unique()]

df.assign(Val = df.pivot(index = None, columns = 'Col')
                  .reindex(columns = cols)
                  .ffill(axis=1)
                  .iloc[:, -1])

  Col  A  B  Val
0   B  1  5  5.0
2   A  2  6  2.0
8   A  3  7  3.0
9   B  4  8  8.0

何玺

2023-03-14

执行此操作有2种其他方法：

应用可以在轴=1上使用，以便使用列值作为键：

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)

df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

无论索引类型如何，此操作都可以工作：

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

#   Col  A  B
# 0   B  1  5
# 2   A  2  6
# 8   A  3  7
# 9   B  4  8

df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)

df：

  Col  A  B  Val
0   B  1  5    5
2   A  2  6    2
8   A  3  7    3
9   B  4  8    8

处理缺失/非对应值时，我们可以使用系列。get可用于解决此问题：

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'C', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

#    Col  A  B
# 0    B  1  5
# 1    A  2  6
# 2    C  3  7 <- Non Corresponding
# 3  NaN  4  8 <- Missing

df['Val'] = df.apply(lambda row: row.get(row['Col']), axis=1)

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    C  3  7  NaN  # Missing value
3  NaN  4  8  NaN  # Missing value

使用默认值

df['Val'] = df.apply(lambda row: row.get(row['Col'], default=-1), axis=1)

   Col  A  B  Val
0    B  1  5    5
1    A  2  6    2
2    C  3  7   -1  # Default -1
3  NaN  4  8   -1  # Default -1

应用程序非常灵活，修改也非常简单，然而，在大型数据帧中，一般的迭代方法以及所有单独的系列查找可能会变得非常昂贵。

我ndex.get_indexer可用于将列转换为DataFrame的索引值。这意味着没有理由重新索引DataFrame，因为索引器与整个DataFrame相对应。

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]

df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

这种方法相当快，但是，缺少的值由-1表示，这意味着如果缺少值，它将从-1列（数据帧中的最后一列）中获取值。

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8],
                   'Col': ['B', 'A', 'A', 'C']})

#    A  B Col <- Col is now the Last Col
# 0  1  5   B
# 1  2  6   A
# 2  3  7   A
# 3  4  8   C <- Notice Col `C` does not correspond to a Valid Column Header
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]

df：

   A  B Col Val
0  1  5   B   5
1  2  6   A   2
2  3  7   A   3
3  4  8   C   C  # <- Value from the last column in the DataFrame (index -1)

还值得注意的是，不重新索引数据帧意味着将整个数据帧转换为numpy。如果有许多不相关的列都需要转换，这可能会非常昂贵：

import numpy as np
import pandas as pd

df = pd.DataFrame({1: 10,
                   2: 20,
                   3: 't',
                   4: 40,
                   5: np.nan,
                   'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]

df.to_numpy()

[[10 20 't' 40 nan 'B' 1 5 5]
 [10 20 't' 40 nan 'A' 2 6 2]
 [10 20 't' 40 nan 'A' 3 7 3]
 [10 20 't' 40 nan 'B' 4 8 8]]

与仅包含与列值相关的列的重新索引方法相比：

df.reindex(columns=['B', 'A']).to_numpy()
[[5 1]
 [6 2]
 [7 3]
 [8 4]]

夏侯瑞

2023-03-14

通过索引/列标签查找值的留档建议通过factorize和reindex使用NumPy索引来替换已弃用的DataFrame.lookup。

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

factorize用于转换列，将值编码为“枚举类型”。

idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')

请注意，B对应于0，A对应于1<代码>重新索引用于确保列以与枚举相同的顺序出现：

df.reindex(columns=col)

   B  A  # B appears First (location 0) A appers second (location 1)
0  5  1
1  6  2
2  7  3
3  8  4

我们需要创建一个与NumPy索引兼容的适当的范围索引器。

标准方法是根据DataFrame的长度使用np.arange：

np.arange(len(df))

[0 1 2 3]

现在NumPy索引将用于从DataFrame中选择值：

df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

[5 2 3 8]

*注意：无论索引类型如何，此方法始终有效。

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

仅在这种情况下，没有错误，因为来自np.arange的结果与df.index相同。

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

引发索引器：

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: index 8 is out of bounds for axis 0 with size 4

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

引发索引器：

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

有几种方法。

首先让我们看看默认情况下，如果有一个不对应的值会发生什么：

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#   Col  A  B
# 0   B  1  5
# 1   A  2  6
# 2   A  3  7
# 3   C  4  8

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

  Col  A  B  Val
0   B  1  5  5.0
1   A  2  6  2.0
2   A  3  7  3.0
3   C  4  8  NaN  # NaN Represents the Missing Value in C

如果我们看一下引入NaN值的原因，我们会发现当factorize遍历列时，它将枚举所有存在的组，而不管它们是否对应于列。

因此，当我们reindex数据帧时，我们将得到以下结果：

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
   B  A   C
0  5  1 NaN
1  6  2 NaN
2  7  3 NaN
3  8  4 NaN  # Reindex adds the missing column with the Default `NaN`

如果我们想指定一个默认值，我们可以指定reindex的fill_value参数，它允许我们修改与缺少列值相关的行为：

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
   B  A  C
0  5  1  0
1  6  2  0
2  7  3  0
3  8  4  0  # Notice reindex adds missing column with specified value `0`

这意味着我们可以做到：

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
    columns=col, 
    fill_value=0  # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]

df：

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0

*请注意，列的dtype是int，因为从未引入NaN，因此列类型没有更改。

factorize有一个默认值na_sentinel=-1，这意味着当NaN值出现在被分解的列中时，得到的idx值为-1

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#    Col  A  B
# 0    B  1  5
# 1    A  2  6
# 2    A  3  7
# 3  NaN  4  8  # <- Missing Lookup Key

idx, col = pd.factorize(df['Col'])
# idx = array([ 0,  1,  1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
#    Col  A  B  Val
# 0    B  1  5    5
# 1    A  2  6    2
# 2    A  3  7    3
# 3  NaN  4  8    4 <- Value From A

这-1意味着，默认情况下，我们将在重新索引时从最后一列中提取。请注意，ol仍然只包含值B和A。这意味着，我们将在最后一行的Val中得到来自A的值。

处理此问题的最简单方法是使用列标题中找不到的值填充naCol。

这里我使用空字符串"：

idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')

现在，当我重新编制索引时，'列将包含NaN值，这意味着查找将产生所需的结果：

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df：

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # Missing as expected

类似资料：

熊猫索引列标题或名称

如何在python熊猫中获得索引列名称？下面是一个示例数据框：我想做的是获取/设置数据框索引标题。这是我所尝试的：有人知道怎么做吗？
熊猫：通过多列查找另一个DataFrame中不存在的行

问题内容：与此python pandas一样：如何在一个数据框中找到行，而在另一个数据框中却找不到？但是有多列这是设置：现在，我要选择其他行中不存在的行。我想通过和进行选择在SQL中，我会做：在熊猫里，我可以做这样的事情，但是感觉很丑。如果df具有id列，则可以避免部分丑陋的情况，但并非总是如此。因此，也许有一些更优雅的方法？问题答案：由于有一个新的参数，您可以传递给它，以告诉您
通过查询更改SQL列标题

问题内容：我有以下查询：我想为该选项命名时如何更改标题。问题答案：使用这样的别名：如果要更改列名，则不仅要为此查询使用，而且通常要使用ALTER TABLE
熊猫:将列透视到标题

我正在尝试将列中的值透视到列标题，但保留其余数据。这是我的完整代码，以及我能得到的最接近我正在寻找的内容。唯一的问题是我无法弄清楚如何保留列：原始数据帧: 我最近的支点尝试：电流输出：期望输出：这个和这个我都试过了，没有成功。任何帮助都将不胜感激。
通过标准查找猫鼬子文档

我刚刚被这个问题缠住了。我有两个猫鼬模式：问题是，如何从每个父文档中获取所有子文档（在这种情况下，对象）？假设我有一些数据：我想在一个查询中检索所有18岁以上的儿童。有可能吗？每一个回答都将不胜感激，谢谢！
更新大熊猫的价值

问题内容：我正在做一些地理编码工作，我曾用它来屏幕刮取位置地址所需的xy坐标，我将xls文件导入了panda数据框，并希望使用显式循环来更新没有xy坐标的行，例如下面：我已经阅读了为什么在遍历熊猫DataFrame之后该功能不能“使用”？并且完全意识到，iterrow仅提供给我们一个视图，而不是一个供编辑的副本，但是如果我真的要逐行更新值怎么办？是否可行？问题答案：您从中获得的行是不再连接

在熊猫1.2.0或更新版本中通过相应的列标题查找值

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档