在熊猫中分解一列字符串

郗缪文

2023-03-14

问题内容：

如问题所述，我有一个df_original很大的数据框，但看起来像：

        ID    Count   Column 2   Column 3  Column 4
RowX    1      234.     255.       yes.      452
RowY    1      123.     135.       no.       342
RowW    1      234.     235.       yes.      645
RowJ    1      123.     115.       no.       342
RowA    1      234.     285.       yes.      233
RowR    1      123.     165.       no.       342
RowX    2      234.     255.       yes.      234
RowY    2      123.     135.       yes.      342
RowW    2      234.     235.       yes.      233
RowJ    2      123.     115.       yes.      342
RowA    2      234.     285.       yes.      312
RowR    2      123.     165.       no.       342
.
.
.
RowX    1233   234.     255.       yes.      133
RowY    1233   123.     135.       no.       342
RowW    1233   234.     235.       no.       253
RowJ    1233   123.     115.       yes.      342
RowA    1233   234.     285.       yes.      645
RowR    1233   123.     165.       no.       342

我试图摆脱文本数据，并将其替换为预定义的数值等效项。例如，在这种情况下，我想分别用或替换Column3的yes或no值。有没有一种方法无需我手动输入和更改值？1``0

问题答案：

v

RowX    yes
RowY     no
RowW    yes
RowJ     no
RowA    yes
RowR     no
RowX    yes
RowY    yes
RowW    yes
RowJ    yes
RowA    yes
RowR     no
Name: Column 3, dtype: object

`pd.factorize`

1 - pd.factorize(v)[0]
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])

`np.where`

np.where(v == 'yes', 1, 0)
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])

`pd.Categorical`/`astype('category')`

pd.Categorical(v).codes
array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0], dtype=int8)



v.astype('category').cat.codes

RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
dtype: int8

`pd.Series.replace`

v.replace({'yes' : 1, 'no' : 0})

RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
Name: Column 3, dtype: int64

上面的一个有趣的通用版本：

v.replace({r'^(?!yes).*$' : 0}, regex=True).astype(bool).astype(int)

RowX    1
RowY    0
RowW    1
RowJ    0
RowA    1
RowR    0
RowX    1
RowY    1
RowW    1
RowJ    1
RowA    1
RowR    0
Name: Column 3, dtype: int64

一切都不"yes"是0。

在熊猫中分解一列字符串

`pd.factorize`

`np.where`

`pd.Categorical`/`astype('category')`

`pd.Series.replace`

相关阅读

相关文章

相关问答

相关工具

相关文档

在熊猫中分解一列字符串

pd.factorize

np.where

pd.Categorical/astype('category')

pd.Series.replace

相关阅读

相关文章

相关问答

相关工具

相关文档

`pd.factorize`

`np.where`

`pd.Categorical`/`astype('category')`

`pd.Series.replace`